WO2022228179A1 - Procédé et appareil de traitement vidéo, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de traitement vidéo, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2022228179A1
WO2022228179A1 PCT/CN2022/087381 CN2022087381W WO2022228179A1 WO 2022228179 A1 WO2022228179 A1 WO 2022228179A1 CN 2022087381 W CN2022087381 W CN 2022087381W WO 2022228179 A1 WO2022228179 A1 WO 2022228179A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
subtitle
dubbing
speech rate
duration
Prior art date
Application number
PCT/CN2022/087381
Other languages
English (en)
Chinese (zh)
Inventor
杜育璋
刘坚
李磊
王明轩
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022228179A1 publication Critical patent/WO2022228179A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • H04N21/4355Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream involving reformatting operations of additional data, e.g. HTML pages on a television screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/462Content or additional data management, e.g. creating a master electronic program guide from data received from the Internet and a Head-end, controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Definitions

  • the present disclosure relates to the field of information technology, and in particular, to a video processing method, apparatus, electronic device, and storage medium.
  • terminals have become an indispensable device in people's lives. For example, users can watch videos through the terminal.
  • Some current videos may be videos in other languages, and users may not understand the audio content in the videos.
  • the existing technology is to display subtitles that the user can read in the video, but in some cases, the speed at which the user browses the subtitles may not match the display speed of the subtitles, thereby reducing the user experience.
  • embodiments of the present disclosure provide a video processing method, apparatus, electronic device, and storage medium.
  • An embodiment of the present disclosure provides a video processing method, and the method includes:
  • the first subtitle is translated to obtain the second subtitle
  • the dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.
  • Embodiments of the present disclosure also provide a video processing apparatus, including:
  • an acquisition module for acquiring the first subtitle in the original video
  • a translation module for translating the first subtitle to obtain a second subtitle
  • a dubbing module configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.
  • Embodiments of the present disclosure also provide an electronic device, the electronic device comprising:
  • processors one or more processors
  • a storage device for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method as described above.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the above-mentioned video processing method.
  • Embodiments of the present disclosure also provide a computer program product, where the computer program product includes a computer program or instructions, and when the computer program or instructions are executed by a processor, implement the video processing method as described above.
  • the technical solution provided by the embodiment of the present disclosure has at least the following advantages: the technical solution provided by the embodiment of the present disclosure obtains the first subtitle in the original video by setting; translates the first subtitle to obtain the second subtitle; Determine the target speech rate of dubbing; generate dubbing audio corresponding to the second subtitle according to the target speech rate of dubbing, which can realize the generation of dubbing audio that can be understood by video viewers, which can help users reduce the difficulty of understanding video content and improve user experience.
  • the display duration of the target subtitle and/or the duration of the dubbed audio corresponding to the second subtitle are adjusted by determining the display duration of the target subtitle and the target speech rate of the dubbing. , so that the duration of the dubbing audio is consistent with the display duration of the target subtitles within the allowable range of error, so as to solve the problem that for the same meaning, the length of the sentences expressed in different languages may be different, resulting in the dubbing duration and subtitle display duration. Mismatch issues improve user experience.
  • FIG. 1 is a flowchart of a video processing method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure
  • FIG. 3 is a flowchart of another video processing method provided by an embodiment of the present disclosure.
  • FIG. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart of another method for implementing S130 provided by an embodiment of the present disclosure.
  • FIG. 6 is a flowchart of another method for implementing S130 provided by an embodiment of the present disclosure.
  • FIG. 7 is a flowchart of another video processing method provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure.
  • This embodiment is applicable to the case of dubbing a video in a client.
  • the method can be executed by a video processing device, and the device can use software and/or or hardware, the device can be configured in electronic equipment, such as terminals, specifically including but not limited to smart phones, PDAs, tablet computers, wearable devices with display screens, desktop computers, notebook computers, all-in-one computers, smart phones household equipment, etc.
  • this embodiment may be applicable to the case of dubbing the video in the server, the method may be executed by a video processing apparatus, and the apparatus may be implemented by means of software and/or hardware, and the apparatus may be configured in an electronic device, such as server.
  • the method may specifically include:
  • the first subtitle is the same as the language used by the character in the original video. Exemplarily, if the character in the video speaks in English, the first subtitle is in English.
  • the first subtitle can be directly extracted.
  • the original video does not include the first subtitle
  • speech recognition is performed on any piece of audio in the original video to obtain the first subtitle.
  • the first subtitle is a subtitle corresponding to any audio segment in the original video.
  • any piece of audio refers to audio information corresponding to any sentence spoken by any character in the video.
  • video includes audio stream and video stream.
  • the video stream includes multiple image frames. Multiple image frames are played in chronological order to form a dynamic image of the video. The characters in this video speak during some periods of time and do not speak during other periods of time.
  • the audio stream is composed of multiple pieces of audio, and each piece of audio corresponds to a sentence spoken by a character in the video. A piece of audio corresponds to multiple image frames.
  • FIG. 2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure.
  • the character in the video speaks during the t0 time period and the t3 time period, and does not speak during the rest of the time period.
  • the image frame in the time period t0 corresponds to a continuous speech
  • the continuous speech corresponding to the time period t0 constitutes a piece of audio.
  • the image frame in the time period t3 corresponds to another continuous speech, and the continuous speech corresponding to the time period t3 constitutes a piece of audio.
  • each continuous sentence of Chinese speech can be recognized as a continuous sentence of Chinese subtitles according to the speech pauses in the audio.
  • the second subtitle is in a different language than the first subtitle.
  • the first subtitle is in Chinese
  • the second subtitle is in English.
  • the second subtitle is in a language understood by the viewer of the video. In actual setting, the second subtitle can be set according to the needs of the video viewer.
  • the target speech rate of dubbing refers to the speech rate of dubbing used for reading the second subtitle.
  • a plurality of dubbing timbre data are pre-stored in the database, and different dubbing timbre data have corresponding default speech rate data.
  • the dubbing tone data is selected, and the default speech rate corresponding to the selected dubbing tone and/or the speech rate after adjusting the default speech rate is used as the dubbing target speech rate.
  • the technical solutions provided by the embodiments of the present disclosure provide the first subtitles in the original video by setting; translate the first subtitles to obtain the second subtitles; determine the target speech rate of the dubbing; Dubbing audio can generate dubbing audio that can be understood by video viewers, which can help users reduce the difficulty of understanding video content and improve user experience.
  • the dubbed audio obtained by the above technical solution of the present application may be played alone, or the audio in the original video may be replaced with the dubbed audio to obtain a dubbed video. You can also play the original video synchronously with the dubbed audio, so as to achieve the goal of having a dubbing effect when the video is played.
  • FIG. 3 is a flowchart of a video processing method provided by an embodiment of the present disclosure.
  • FIG. 3 is a specific example of FIG. 1 . Referring to Figure 3, the method includes:
  • S130 Determine the display duration of the target subtitle and the target speech rate of the dubbing, where the target subtitle includes the first subtitle and/or the second subtitle.
  • dubbing audio based on the second subtitles
  • the dubbing audio duration and the playback duration of the image frame corresponding to the dubbing audio (which can also be understood as the first subtitle).
  • the display duration of the first subtitle and/or the second subtitle that is, the display duration of the target subtitle here) is equivalent. This is because if the duration of the dubbed audio is greater than the playback duration of the image frame corresponding to the dubbed audio, it will appear that the dubbed audio has not ended, but the image frame has been played.
  • the duration of the dubbed audio is less than the playback duration of the image frame corresponding to the dubbed audio, it will appear that the dubbed audio has ended, but the image frame is still displayed. Both of these situations can cause the image to be out of sync with the audio, affecting the user experience.
  • dubbing audio duration equal to the playback duration of the image frame corresponding to the dubbing audio
  • two parameters must be specified first, one is the dubbing audio duration, and the other is the playback duration of the image frame corresponding to the dubbing audio.
  • the duration of the dubbing audio mainly depends on two quantities, one is the length of the second subtitle, and the other is the speech rate of the dubbing. Since the second subtitle has been obtained in S120, the length of the second subtitle is constant in this step, and the duration of the dubbing audio at this time mainly depends on the speech rate of the dubbing.
  • the essence of this step is to determine the appropriate display duration of the target subtitle and the target speech rate of the dubbing, so that the duration of the dubbed audio corresponding to the second subtitle is consistent with the display duration of the target subtitle within the allowable error range .
  • the sequence between determining the display duration of the target subtitles and determining the target speech rate for dubbing is not limited.
  • determining the display duration of the target subtitles and determining the target speech rate of dubbing may be two independent processes, or may be interrelated processes.
  • the essence of this step is to read the text information in the second subtitle at the target speech rate of the dubbing determined in S130, and perform the dubbing audio corresponding to the second subtitle.
  • the target video is obtained by replacing the Chinese audio in the original video with the English audio. And add Chinese subtitles and/or English subtitles to the picture of the video frame.
  • the essence of the above technical solutions of the present application is that by determining the display duration of the target subtitles and the target speech rate of the dubbing, the display duration of the target subtitles and/or the duration of the dubbing audio corresponding to the second subtitles are adjusted, so that the dubbing is performed.
  • the audio duration is consistent with the display duration of the target subtitles within the allowable range of error, so as to solve the problem that for the same meaning, the length of sentences expressed in different languages may be different, resulting in a mismatch between the dubbing duration and the subtitle display duration problems and improve user experience.
  • S150 can also be replaced by: generating an audio file according to the dubbing audio corresponding to each segment of audio in the original video, and the audio file includes the dubbing audio corresponding to each segment of audio in the original video. , and time information for each dubbed audio.
  • each dubbing audio is called and played in sequence according to the current playing time progress of the original video and the time information of each dubbing audio.
  • the time information of each dubbed audio includes the start time and/or the end time of the dubbed audio.
  • each audio file includes the start time of the dubbing audio corresponding to the audio file.
  • the start time of the dubbing audio A is the 12th second from the playback of the first image frame of the video.
  • the dubbing audio A is called and played.
  • the original audio of the original video is eliminated.
  • only the audio of the characters speaking in the original video is eliminated, and the background sound of the original video is retained.
  • a play button or icon corresponding to the audio file may be displayed in the user interface for playing the target video.
  • the audio in the target video is still the audio in the original video, that is, the audio in the original video is not replaced with the corresponding dubbing audio.
  • the button or icon is on, the audio in the original video is replaced with the corresponding dubbing audio, that is, the audio in the target video becomes the dubbing audio.
  • the terminal can play the dubbing audio alone.
  • Case 1 For a piece of audio in the original video, the duration of the dubbed audio is obtained based on the default speech rate of the dubbing and the second subtitle.
  • the duration of the dubbed audio and the duration of the audio in the original video are originally the same within the allowable error range. In this case, it can be directly determined that the display duration of the target subtitle is the time corresponding to the audio segment, and the target speech rate of the dubbing is the default speech rate.
  • a plurality of dubbing timbre data are pre-stored in the database, and different dubbing timbre data have corresponding default speech rate data.
  • the default speech rate of dubbing is the default speech rate corresponding to the selected dubbing timbre, which is obtained based on the dubbing timbre data stored in the database.
  • Case 2 For a piece of audio in the original video, the duration of the dubbed audio is obtained based on the default speech rate of the dubbing and the second subtitle.
  • the duration of the dubbed audio and the duration of the audio in the original video are originally quite different, and cannot be considered to be consistent within the allowable range of errors.
  • the display duration of the target subtitle can be determined based on the duration of the audio in the original video; at the same time, the target speech rate of the dubbing can be determined based on the default speech rate of the dubbing.
  • FIG. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the above case one. Referring to Figure 4, the method includes:
  • S1311. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
  • the display duration of the target subtitle is the time corresponding to any piece of audio
  • the target speech rate for dubbing is the default speech rate.
  • the "first difference between the duration of any piece of audio and the default duration” may be the absolute value of the difference between the duration of any piece of audio and the default duration, or may be the ratio of the duration of any piece of audio to the default duration.
  • the essence of this setting is that, within the allowable range of errors, if the duration of any audio segment is consistent with the default duration of its corresponding dubbing, the display duration of the target subtitle is directly determined to be the time corresponding to this segment of audio, and the target speech rate of the dubbing is the default speech rate.
  • FIG. 5 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the second case above. Referring to Figure 5, the method includes:
  • S1321. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
  • the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio after adjustment.
  • the display duration of the target subtitle cannot be infinite after the increase. If the display duration of the target subtitle after the increase is too long (exceeding a certain limit), it means that the switching speed of the image frame corresponding to this audio segment is too slow, while the switching speed of the image frames corresponding to other audios is normal, which will cause the overall video. Discord and affect the user experience. Therefore, the maximum display duration of the target subtitle can be limited by setting a duration adjustment parameter.
  • the duration adjustment parameter is x1
  • the display duration of the subtitle after adjustment is T2
  • T1/x1 is a number less than 1.
  • the implementation method for increasing the display duration of the target subtitles includes: reducing the display speed of a picture corresponding to any piece of audio.
  • the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x1.
  • the display speed adjustment parameter is x1
  • the display speed of the picture corresponding to the audio segment after adjustment is V2
  • V2 V1 ⁇ x1
  • the display speed of the image frame is reduced by 10%, and the image corresponding to the t0 time period
  • the total number of frames is 20, so the display duration of the 20 image frames becomes t0/x1.
  • Changing the value of x1 from 1 to 0.9 is equivalent to increasing the display duration of the image frame, that is, increasing the display duration of the group of subtitles.
  • the target speech rate of dubbing is selected to be increased, if the increased target speech rate is too fast (over a certain limit), the user may not hear clearly, which will affect the user experience. Therefore, the maximum value of the target speech rate can be limited by setting the speech rate adjustment parameter.
  • the default speech rate corresponding to the dubbing tone selected for a certain piece of audio in the original video is V3
  • the speech rate adjustment parameter is x2
  • the default speech shorthand of the dubbing tone is V3, and V3 is fixed.
  • the shorthand for the target language of the dubbing tone is V4, and the initial value of V4 is V3.
  • the length of the character can be understood as the number of words in the character, the number of syllables in the character, or the like.
  • the product of the display duration of the target subtitle and the target speech rate of dubbing is equal to the length of the second subtitle, which means that the length of readable text at the target speech rate of dubbing within the display duration of the target subtitle is exactly equal to the length of the second subtitle.
  • the time required to read the second subtitle at the target speech rate of dubbing is exactly equal to the display duration of the target subtitle.
  • the second difference may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle.
  • the ratio of the lengths of the two subtitles may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle.
  • the maximum display duration and maximum target speech rate of the target subtitles need to be limited, which will make the length of the second subtitle within a certain range (that is, the "preset range" in S1322 )Inside.
  • the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles can be less than or equal to the first Two thresholds.
  • the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles cannot be less than or equal to the second threshold.
  • the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range.
  • the preset range is related to the default speech rate and the duration of any piece of audio.
  • the preset range includes an upper limit value (ie, a maximum value), and the upper limit value is related to the default speech rate, the duration of any piece of audio, the minimum value of the duration adjustment parameter, and the maximum value of the speech rate adjustment parameter.
  • an upper limit value ie, a maximum value
  • n1 (V3*1.1)(T1/0.9).
  • the target speech rate of dubbing is the default speech rate
  • gradually increase the target speech rate of dubbing if the target speech rate of dubbing has reached the maximum value and the second difference is greater than the second threshold, the display duration of the target subtitles will be On the basis of the duration of any piece of audio, the display duration of the target subtitle is gradually increased until the second difference is less than or equal to the second threshold.
  • Priority is given to adjusting the speech rate adjustment parameter x2, and the value of x2 gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals.
  • the display duration of the target subtitles is the duration of any piece of audio
  • gradually increase the display duration of the target subtitles if the display duration of the target subtitles has reached the maximum value and the second difference is greater than the second threshold, then the dubbing target
  • the target speech rate of dubbing is gradually increased until the second difference is less than or equal to the second threshold.
  • Priority is given to adjusting the duration adjustment parameter x1, and the value of x1 gradually decreases from 1 to 0.9. For example, in the order from 1 to 0.9, the values are taken at intervals.
  • the display duration of the target subtitles is the duration of any piece of audio
  • the target speech rate of dubbing is the default speech rate
  • other filtering conditions can also be added, such as, x1+x2 is the smallest, 2x1+x2 is the smallest, x1 2 +x2 2 is the smallest, etc., to obtain the optimal combination of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 .
  • FIG. 6 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the second case above. Referring to Figure 6, the method includes:
  • S1331. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
  • the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio after adjustment.
  • the minimum display duration of the target subtitle can be limited by setting a duration adjustment parameter.
  • the duration adjustment parameter is x1
  • the display duration of the subtitle after adjustment is T2
  • T1/x1 is a number greater than 1.
  • the implementation method for reducing the display duration of the target subtitles includes: increasing the display speed of a picture corresponding to any piece of audio.
  • the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x1.
  • the display speed adjustment parameter is x1
  • the display speed of the picture corresponding to the audio segment after adjustment is V2
  • V2 V1 ⁇ x1
  • the target speech rate of the dubbing is selected to be reduced, if the target speech rate after the reduction is too slow (exceeding a certain limit), the speech rate of this segment of audio will be too slow, while the speech rates corresponding to other audios are normal, It will cause the overall disharmony of the video and affect the user experience. Therefore, the minimum value of the target speech rate can be limited by setting the speech rate adjustment parameter.
  • the default speech rate corresponding to the dubbing tone selected for a certain segment of audio in the original video is V3
  • the speech rate adjustment parameter is x2
  • the default speech shorthand of the dubbing tone is V3, and V3 is fixed.
  • the shorthand for the target language of the dubbing tone is V4, and the initial value of V4 is V3.
  • the length of the character can be understood as the number of words in the character, the number of syllables in the character, or the like.
  • the product of the display duration of the target subtitle and the target speech rate of dubbing is equal to the length of the second subtitle, which means that the length of readable text at the target speech rate of dubbing within the display duration of the target subtitle is exactly equal to the length of the second subtitle.
  • the time required to read the second subtitle at the target speech rate of dubbing is exactly equal to the display duration of the target subtitle.
  • the second difference may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle.
  • the ratio of the lengths of the two subtitles may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle.
  • the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles can be less than or equal to the first Two thresholds.
  • the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles cannot be less than or equal to the second threshold.
  • the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range.
  • the preset range is related to the default speech rate and the duration of any piece of audio.
  • the preset range includes a lower limit value (ie, a minimum value), and the lower limit value is related to the default speech rate, the duration of any piece of audio, the maximum value of the duration adjustment parameter, and the minimum value of the speech rate adjustment parameter.
  • a lower limit value ie, a minimum value
  • n1 (V3*0.9)(T1/1.1).
  • the target speech rate of dubbing is the default speech rate
  • gradually reduce the target speech rate of dubbing if the target speech rate of dubbing has reached the minimum value and the second difference is greater than the second threshold, the display duration of the target subtitles is On the basis of the duration of any piece of audio, the display duration of the target subtitle is gradually reduced until the second difference is less than or equal to the second threshold.
  • Priority is given to adjusting the speech speed adjustment parameter x2, and the value of x2 gradually decreases from 1 to 0.9. For example, in the order from 1 to 0.9, the values are taken at intervals.
  • the display duration of the target subtitles is the duration of any piece of audio
  • gradually reduce the display duration of the target subtitles if the display duration of the target subtitles has reached the minimum value and the second difference is greater than the second threshold, the dubbing target
  • the target speech rate of dubbing is gradually reduced until the second difference is less than or equal to the second threshold.
  • Priority is given to adjusting the duration adjustment parameter x1, and the value of x1 gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals.
  • the display duration of the target subtitles is the duration of any piece of audio
  • the target speech rate of dubbing is the default speech rate
  • other filtering conditions can also be added, such as, x1+x2 is the smallest, 2x1+x2 is the smallest, x1 2 +x2 2 is the smallest, etc., to obtain the optimal combination of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 .
  • FIG. 7 is a flowchart of another video processing method provided by an embodiment of the present disclosure.
  • the original video includes multiple pieces of audio, and the multiple pieces of audio are the voices of multiple target objects.
  • the target object can be understood as a person in the video.
  • the method further includes:
  • dubbing timbre data are stored in the database in advance, and different dubbing timbre data correspond to different character attribute data.
  • the person attribute data includes the age, gender, tone, occupation, and the like of the person.
  • the dubbing timbres corresponding to the same target object are the same, and the dubbing timbres corresponding to different target objects are different.
  • the above technical solution selects the dubbing timbre corresponding to the target object for each target object in the plurality of target objects; Correspondingly, it is convenient for the user to distinguish different characters after dubbing from the aspect of sound, which can improve the user experience.
  • FIG. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure.
  • the video processing apparatus provided by the embodiment of the present disclosure may be configured in a client or may be configured in a server, and the video processing apparatus specifically includes:
  • an obtaining module 310 configured to obtain the first subtitle in the original video
  • a translation module 320 configured to translate the first subtitle to obtain a second subtitle
  • a determination module 330 configured to determine the target speech rate of the dubbing
  • the dubbing module 340 is configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.
  • the first subtitle is a subtitle corresponding to any segment of audio in the original video
  • the determining module 330 is further configured to determine the display duration of target subtitles, where the target subtitles include the first subtitle and/or the second subtitle;
  • the device also includes a replacement module 350, which is used to replace the audio in the original video with the dubbing audio to obtain a target video, which corresponds to the display duration of the target subtitles in the target video.
  • the target subtitle is displayed on the screen.
  • the determination module is used to:
  • the display duration of the target subtitle is the time corresponding to any piece of audio, and the target speech rate of the dubbing is the default speech rate.
  • the device also includes a first adjustment module.
  • the first adjustment module is used to:
  • the display duration of the target subtitle is increased, and/or the target speech rate of the dubbing is increased, so that the display duration of the target subtitle is the same as that of the target subtitle.
  • a second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
  • the first adjustment module is used for:
  • the target speech rate of the dubbing is the default speech rate, gradually increase the target speech rate of the dubbing;
  • the target speech rate of the dubbing has reached the maximum value, and the second difference is greater than the second threshold, gradually increase the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
  • the first adjustment module is used for:
  • the display duration of the target subtitle is the duration of any piece of audio, gradually increase the display duration of the target subtitle
  • the target speech rate of the dubbing is gradually increased on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
  • the first adjustment module is used for:
  • the display duration of the target subtitles is the duration of any piece of audio
  • the target speech rate of the dubbing is the default speech rate
  • the target speech rate of the dubbing is gradually increased until the second difference is less than or equal to a second threshold.
  • the first adjustment module increases the display duration of the target subtitle by reducing the display speed of the picture corresponding to any piece of audio.
  • the device also includes a second adjustment module.
  • the second adjustment module is used for:
  • the duration of any piece of audio is greater than the default duration, and the first difference between the duration of any piece of audio and the default duration is greater than a first threshold, determine whether the length of the second subtitle is within the predetermined duration within the set range;
  • the display duration of the target subtitle is reduced, and/or the target speech rate of the dubbing is reduced, so that the display duration of the target subtitle is the same as that of the target subtitle.
  • a second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
  • the second adjustment module is used for:
  • the target speech rate of the dubbing is the default speech rate, gradually reduce the target speech rate of the dubbing;
  • the target speech rate of the dubbing has reached the minimum value, and the second difference is greater than the second threshold, on the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
  • the second adjustment module is used for:
  • the display duration of the target subtitle is the duration of the audio of any segment, gradually reduce the display duration of the target subtitle
  • the target speech rate of the dubbing is gradually reduced on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
  • the second adjustment module is used for:
  • the display duration of the target subtitles is the duration of any piece of audio
  • the target speech rate of the dubbing is the default speech rate
  • the target speech rate of the dubbing is gradually decreased until the second difference is less than or equal to a second threshold.
  • the second adjustment module reduces the display duration of the target subtitle by increasing the display speed of the picture corresponding to any piece of audio.
  • translation module is also used to:
  • the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range .
  • the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio.
  • the preset range is related to the default speech rate and the duration of any piece of audio.
  • the original video includes multiple pieces of audio, and the multiple pieces of audio are the voices of multiple target objects;
  • the device further includes a selection module; the selection module is configured to, for each target object in the plurality of target objects, select a dubbing timbre corresponding to the target object;
  • a dubbing module configured to generate a plurality of dubbing audios corresponding to the multi-segment audios according to the dubbing timbres corresponding to each target object respectively;
  • a replacement module configured to replace the multiple audio segments in the original video with the multiple dubbed audios to obtain a target video.
  • the video processing apparatus provided by the embodiments of the present disclosure can execute the steps performed by the client or the server in the video processing method provided by the method embodiments of the present disclosure, and the execution steps and beneficial effects are not repeated here.
  • FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring specifically to FIG. 9 below, it shows a schematic structural diagram of an electronic device 1000 suitable for implementing an embodiment of the present disclosure.
  • the electronic device 1000 in the embodiment of the present disclosure may include, but is not limited to, such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal ( Mobile terminals such as in-vehicle navigation terminals), wearable electronic devices, etc., and stationary terminals such as digital TVs, desktop computers, smart home devices, and the like.
  • the electronic device shown in FIG. 9 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 1000 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 1001, which may be loaded into random access according to a program stored in a read only memory (ROM) 1002 or from a storage device 1008
  • a program in the memory (RAM) 1003 executes various appropriate actions and processes to implement the video processing method of the embodiment as described in the present disclosure.
  • RAM 1003 various programs and information necessary for the operation of the electronic device 1000 are also stored.
  • the processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other through a bus 1004.
  • An input/output (I/O) interface 1005 is also connected to the bus 1004 .
  • the following devices can be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 1007 such as a computer; a storage device 1008 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1009 .
  • Communication means 1009 may allow electronic device 1000 to communicate wirelessly or by wire with other devices to exchange information. While FIG. 9 shows the electronic device 1000 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flowchart, thereby achieving the above the video processing method.
  • the computer program may be downloaded and installed from the network via the communication device 1009, or from the storage device 1008, or from the ROM 1002.
  • the processing apparatus 1001 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include an information signal in baseband or propagated as part of a carrier wave with computer-readable program code embodied thereon. Such propagated information signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the client and server can use any known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital information in any form or medium (eg, a communications network) interconnected.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any known or future developed network.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device:
  • the first subtitle is translated to obtain the second subtitle
  • the dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.
  • the electronic device may also perform other steps described in the above embodiments.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself under certain circumstances.
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the present disclosure provides an electronic device, comprising:
  • processors one or more processors
  • memory for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method as provided in any one of the present disclosure.
  • the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements video processing as described in any one of the present disclosure method.
  • Embodiments of the present disclosure also provide a computer program product, where the computer program product includes a computer program or instructions, and when the computer program or instructions are executed by a processor, implement the video processing method as described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Television Signal Processing For Recording (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

L'invention concerne un procédé et un appareil de traitement vidéo, un dispositif électronique et un support de stockage. Le procédé consiste à : acquérir des premiers sous-titres dans une vidéo d'origine ; traduire les premiers sous-titres pour obtenir des seconds sous-titres ; déterminer un débit de parole cible pour un doublage ; et, en fonction du débit de parole cible pour un doublage, générer un signal audio de doublage correspondant aux seconds sous-titres.
PCT/CN2022/087381 2021-04-29 2022-04-18 Procédé et appareil de traitement vidéo, dispositif électronique et support de stockage WO2022228179A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110472124.X 2021-04-29
CN202110472124.XA CN113207044A (zh) 2021-04-29 2021-04-29 视频处理方法、装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022228179A1 true WO2022228179A1 (fr) 2022-11-03

Family

ID=77029350

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087381 WO2022228179A1 (fr) 2021-04-29 2022-04-18 Procédé et appareil de traitement vidéo, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN113207044A (fr)
WO (1) WO2022228179A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113207044A (zh) * 2021-04-29 2021-08-03 北京有竹居网络技术有限公司 视频处理方法、装置、电子设备和存储介质
CN114025236A (zh) * 2021-11-16 2022-02-08 上海大晓智能科技有限公司 一种视频内容理解方法、装置、电子设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160021334A1 (en) * 2013-03-11 2016-01-21 Video Dubber Ltd. Method, Apparatus and System For Regenerating Voice Intonation In Automatically Dubbed Videos
CN109119063A (zh) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 视频配音生成方法、装置、设备及存储介质
CN111683266A (zh) * 2020-05-06 2020-09-18 厦门盈趣科技股份有限公司 一种视频同声翻译配置字幕方法及终端
US20200404386A1 (en) * 2018-02-26 2020-12-24 Google Llc Automated voice translation dubbing for prerecorded video
CN112562721A (zh) * 2020-11-30 2021-03-26 清华珠三角研究院 一种视频翻译方法、系统、装置及存储介质
CN113207044A (zh) * 2021-04-29 2021-08-03 北京有竹居网络技术有限公司 视频处理方法、装置、电子设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195386A1 (en) * 2005-05-31 2008-08-14 Koninklijke Philips Electronics, N.V. Method and a Device For Performing an Automatic Dubbing on a Multimedia Signal
CN109218629B (zh) * 2018-09-14 2021-02-05 三星电子(中国)研发中心 视频生成方法、存储介质和装置
US11159597B2 (en) * 2019-02-01 2021-10-26 Vidubly Ltd Systems and methods for artificial dubbing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160021334A1 (en) * 2013-03-11 2016-01-21 Video Dubber Ltd. Method, Apparatus and System For Regenerating Voice Intonation In Automatically Dubbed Videos
US20200404386A1 (en) * 2018-02-26 2020-12-24 Google Llc Automated voice translation dubbing for prerecorded video
CN109119063A (zh) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 视频配音生成方法、装置、设备及存储介质
CN111683266A (zh) * 2020-05-06 2020-09-18 厦门盈趣科技股份有限公司 一种视频同声翻译配置字幕方法及终端
CN112562721A (zh) * 2020-11-30 2021-03-26 清华珠三角研究院 一种视频翻译方法、系统、装置及存储介质
CN113207044A (zh) * 2021-04-29 2021-08-03 北京有竹居网络技术有限公司 视频处理方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN113207044A (zh) 2021-08-03

Similar Documents

Publication Publication Date Title
WO2022228179A1 (fr) Procédé et appareil de traitement vidéo, dispositif électronique et support de stockage
US20220239882A1 (en) Interactive information processing method, device and medium
US11972770B2 (en) Systems and methods for intelligent playback
WO2020098115A1 (fr) Procédé d'ajout de sous-titres, appareil, dispositif électronique et support de stockage lisible par ordinateur
WO2023011142A1 (fr) Procédé et appareil de traitement vidéo, dispositif électronique et support de stockage
KR20220103110A (ko) 비디오 생성 장치 및 방법, 전자 장치, 및 컴퓨터 판독가능 매체
US20230011395A1 (en) Video page display method and apparatus, electronic device and computer-readable medium
CN110418183B (zh) 音视频同步方法、装置、电子设备及可读介质
CN113257218B (zh) 语音合成方法、装置、电子设备和存储介质
WO2023051293A1 (fr) Procédé et appareil de traitement audio, dispositif électronique et support de stockage
CN113507637A (zh) 媒体文件处理方法、装置、设备、可读存储介质及产品
CN112908292A (zh) 文本的语音合成方法、装置、电子设备及存储介质
WO2023165371A1 (fr) Procédé et appareil de lecture audio, ainsi que dispositif électronique et support de stockage
CN113992926B (zh) 界面显示方法、装置、电子设备和存储介质
WO2022012390A1 (fr) Procédé et appareil d'enregistrement vidéo, dispositif électronique et support d'enregistrement
CN114845212A (zh) 音量优化方法、装置、电子设备及可读存储介质
CN114554238A (zh) 直播语音同传方法、装置、介质及电子设备
WO2024037480A1 (fr) Procédé et appareil d'interaction, dispositif électronique et support de stockage
WO2022257777A1 (fr) Procédé et appareil de traitement multimédia, dispositif et support
CN115171645A (zh) 一种配音方法、装置、电子设备以及存储介质
CN112530472B (zh) 音频与文本的同步方法、装置、可读介质和电子设备
CN115967833A (zh) 视频生成方法、装置、设备计存储介质
JP2022095689A (ja) 音声データノイズ低減方法、装置、機器、記憶媒体及びプログラム
CN111652002A (zh) 文本划分方法、装置、设备和计算机可读介质
US11792494B1 (en) Processing method and apparatus, electronic device and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794650

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22794650

Country of ref document: EP

Kind code of ref document: A1