WO2022017459A1 - 字幕生成方法、装置、设备及存储介质 - Google Patents

字幕生成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022017459A1
WO2022017459A1 PCT/CN2021/107845 CN2021107845W WO2022017459A1 WO 2022017459 A1 WO2022017459 A1 WO 2022017459A1 CN 2021107845 W CN2021107845 W CN 2021107845W WO 2022017459 A1 WO2022017459 A1 WO 2022017459A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
display time
audio
fragment
text fragment
Prior art date
Application number
PCT/CN2021/107845
Other languages
English (en)
French (fr)
Inventor
曾衍
常为益
付平非
郑起凡
林兆钦
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to EP21845741.4A priority Critical patent/EP4171018A4/en
Publication of WO2022017459A1 publication Critical patent/WO2022017459A1/zh
Priority to US18/087,631 priority patent/US11837234B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4396Processing of audio elementary streams by muting the audio signal
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47205End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages

Definitions

  • the present disclosure relates to the field of data processing, and in particular, to a method, apparatus, device and storage medium for generating subtitles.
  • Generating subtitles for audio and video files refers to performing speech recognition on the audio and video files, and using the recognition results as subtitles for the audio and video files.
  • the subtitles of the audio and video files are the recognition results obtained by performing overall speech recognition on the audio data on all audio tracks in the audio and video files. Since the audio data on each audio track may affect each other, for example, in the same time period, there may be audio data on multiple audio tracks. The overall identification of the audio data on all the audio tracks in the time period may result in inaccurate identification, which in turn leads to inaccurate subtitles generated for the audio and video files.
  • the present disclosure provides a subtitle generation method, apparatus, device and storage medium, which can improve the accuracy of subtitles generated for audio and video files.
  • the present disclosure provides a method for generating subtitles, the method comprising:
  • speech recognition is performed on the audio data on each audio track in the at least one audio track, respectively, to obtain a text fragment corresponding to each audio track;
  • the subtitle of the target audio and video file is generated.
  • all text fragments corresponding to the at least one audio track have a start display time and an end display time, and based on the text fragment corresponding to each audio track, the subtitles of the target audio and video file are generated, include:
  • the display time compression is performed on the previous text fragment, so that the end display time of the previous text fragment Not later than the start display time of the latter text fragment;
  • the method before the merging of all the text segments based on the time axis to generate the subtitles of the target audio and video file, the method further includes:
  • the method further includes:
  • the subtitle is updated in response to an adjustment operation for the subtitle, wherein the adjustment operation includes an addition operation, a deletion operation, or a modification operation.
  • the method further includes:
  • Display time compression is performed on the subtitles of the target audio/video file based on the variable speed playback multiple of the target audio/video file.
  • the present disclosure provides an apparatus for generating subtitles, the apparatus comprising:
  • the recognition module is used to generate a trigger operation for subtitles for at least one audio track in the target audio and video file, respectively, perform speech recognition on the audio data on each audio track in the at least one audio track, and obtain the corresponding audio track for each audio track.
  • text fragment
  • a generating module is used to generate the subtitles of the target audio and video files based on the text segments corresponding to each audio track.
  • all text fragments corresponding to the at least one audio track have a start display time and an end display time
  • the generation module includes:
  • a sorting submodule for comprehensively sorting all the text fragments based on the start display time of each text fragment in the all text fragments
  • the judgment submodule is used for judging whether the end display time of the previous text fragment in the comprehensively sorted adjacent text fragments is later than the start display time of the next text fragment;
  • a compression submodule configured to compress the display time of the previous text fragment when the end display time of the previous text fragment is later than the start display time of the next text fragment, so that the previous text fragment The end display time of the segment is not later than the start display time of the last text segment;
  • a generating sub-module is configured to combine all the text segments based on the time axis to generate subtitles of the target audio and video file.
  • the device further includes:
  • a determining module configured to determine, among all the text fragments, at least one text fragment with the same start display time, and determine the text fragment with the latest end display time from the at least one text fragment;
  • a deletion module configured to delete other text fragments in the at least one text fragment except the text fragment with the latest display end time.
  • the present disclosure provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a terminal device, the terminal device is made to implement the above method.
  • the present disclosure provides a device comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, when the processor executes the computer program, Implement the above method.
  • An embodiment of the present disclosure provides a method for generating subtitles.
  • a subtitle generation triggering operation for at least one audio track in a target audio-video file is received, the audio data on each audio track in the at least one audio track is respectively generated.
  • Speech recognition to get the text fragment corresponding to each audio track.
  • the subtitle of the target audio and video file is generated.
  • the embodiment of the present disclosure performs independent speech recognition for the audio data on each audio track, avoids the influence of the audio tracks on each other, and can obtain more accurate speech recognition.
  • the voice recognition results are obtained, thereby improving the accuracy of the subtitles generated based on the voice recognition results.
  • FIG. 1 is a flowchart of a method for generating subtitles according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a subtitle generation interface provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of processing a text fragment according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a subtitle display interface provided by an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of another subtitle display interface provided by an embodiment of the present disclosure.
  • FIG. 6 is a structural block diagram of an apparatus for generating subtitles according to an embodiment of the present disclosure
  • FIG. 7 is a structural block diagram of a subtitle generating device according to an embodiment of the present disclosure.
  • the subtitles of audio and video files are the recognition results obtained after the overall speech recognition is performed on the audio data on all audio tracks in the audio and video files.
  • the audio data on each audio track may affect each other, resulting in There may be inaccuracies in the overall speech recognition of video files.
  • the embodiments of the present disclosure provide a method for generating subtitles, which can perform independent speech recognition for the audio data on each audio track, avoids the influence of the audio tracks on each other, and improves the accuracy of the speech recognition result. Further, based on the speech recognition result with higher accuracy, subtitles of audio and video files with higher accuracy can be generated.
  • the subtitle generation method when a subtitle generation trigger operation for at least one audio track in the target audio and video file is received, the audio frequency on each audio track in the at least one audio track is respectively The data is subjected to speech recognition to obtain text fragments corresponding to each audio track. Then, based on the text segment corresponding to each audio track, the subtitle of the target audio and video file is generated.
  • an embodiment of the present disclosure provides a method for generating subtitles.
  • a flowchart of a method for generating subtitles provided in an embodiment of the present disclosure includes:
  • S101 In response to a subtitle generation trigger operation for at least one audio track in the target audio-video file, respectively perform speech recognition on the audio data on each audio track in the at least one audio track, and obtain a text segment corresponding to each audio track .
  • the target audio and video files in the embodiments of the present disclosure may be audio files or video files.
  • the target audio and video file usually includes multiple audio tracks
  • the embodiment of the present disclosure can trigger a subtitle generation operation on the target audio and video file for some or all of the audio tracks in the multiple audio tracks, that is, at least one audio
  • the track may be a part or all of the audio tracks included in the target audio frequency file.
  • FIG. 2 is a schematic diagram of a subtitle generation interface provided by an embodiment of the present disclosure
  • a user can select one or more audio tracks displayed in the interface, and then click " Generate Subtitles button to trigger the Generate Subtitles action for the selected audio track or tracks.
  • the audio data on each audio track in the at least one audio track is determined, and then the audio data on each audio track in the at least one audio track is determined.
  • the audio data is subjected to speech recognition to obtain text fragments corresponding to each audio track. The specific speech recognition manner is not repeated in this embodiment of the present disclosure.
  • the audio data on an audio track usually includes a plurality of audio clips
  • a text clip corresponding to each audio clip is obtained, and the audio files belonging to the same audio track are obtained.
  • the text segments corresponding to the segments constitute the text segments corresponding to the audio track. That is to say, in the present disclosure, the text segments corresponding to the audio track include text segments corresponding to multiple audio segments on the audio track respectively.
  • the processing of the text segment that is, the processing of the text segments corresponding to the multiple audio segments on the audio track.
  • a text segment corresponding to each audio track is obtained.
  • S102 Generate subtitles of the target audio-video file based on the text segment corresponding to each audio track.
  • the respective text segments are merged based on the time axis to generate subtitles of the target audio-video file.
  • each audio clip on the audio track has a start time and an end time
  • the text clip corresponding to the audio clip also has a start display time and an end display time.
  • the start time of the audio clip is used as the The start display time of the text segment corresponding to the audio segment
  • the end time of the audio segment is used as the end display time of the text segment corresponding to the audio segment.
  • the embodiment of the present disclosure first preprocesses each text segment before merging each text segment, so as to In subsequent merging of each text fragment.
  • each text fragment in order to facilitate the processing of each text fragment, before merging each text fragment, first, based on the start display time of each text fragment, comprehensively sort the text fragments corresponding to each audio track. . Generally, the position of the comprehensive sorting of the text fragment that starts to be displayed earlier is higher.
  • adjacent text fragments refers to comprehensively sorting all text fragments corresponding to the selected one or more audio tracks based on the start display time of each text fragment, and based on this comprehensive sorting Adjacent pieces of text determined by the relationship. Therefore, for the two to-be-determined text segments in the adjacent text segments obtained based on the aforementioned comprehensive sorting, the embodiment of the present disclosure needs to determine the end display time of the previous text segment and the latter one of the two to-be-determined text segments The start of the text fragment shows the relationship between times to perform preprocessing.
  • the previous text segment refers to the text segment with an earlier start display time among the two to-be-determined text segments
  • the next text segment refers to the two to-be-determined text segments with an earlier display start time.
  • the late text fragment that is, the "previous text fragment” starts to display earlier than the "next text fragment” starts to display.
  • the end display time of the previous text segment is not later than the start display time of the next text segment, it means that the display times of the previous text segment and the latter text segment do not overlap.
  • the end display time of the previous text fragment is later than the start display time of the next text fragment, it means that the display time of the previous text fragment and the next text fragment overlap.
  • the display time is compressed, so that the end display time of the previous text fragment is not later than the start display time of the next text fragment, so as to avoid the overlapping of the display times of the previous text fragment and the next text fragment.
  • FIG. 3 is a schematic diagram of processing a text fragment according to an embodiment of the present disclosure.
  • the selected audio track in the target audio and video file includes track A, track B and track C, and multiple rectangular blocks in the row of each track (track A, track B or track C) are used to represent each track text segments corresponding to each track, and each rectangular block is a text segment.
  • the row where track A is located includes four rectangular blocks, that is, the text segment corresponding to track A includes four text segments (for example, including the text shown in FIG. 3 ).
  • Segments 1 and 3) the row where track B is located includes three rectangular blocks, that is, the text segment corresponding to track B includes three text segments (for example, including text segment 2 shown in FIG.
  • the row where track C is located includes three text segments.
  • a rectangular block that is, the text segment corresponding to the track C includes three text segments (for example, including the text segment 4 shown in FIG. 3 ). Sorting is performed based on the display start time of each text segment. As shown in Figure 3, the text segment corresponding to track A includes text segment 1 and text segment 3, the text segment corresponding to track B includes text segment 2, and the text segment corresponding to track C includes Text fragment 4, text fragment 1 has the earliest display time, followed by text fragment 2, then text fragment 3, then text fragment 4, and so on. Based on the start display time of each text fragment, track A, track B and The text segments corresponding to track C are sorted comprehensively.
  • the ordering relationship of text segment 1 to text segment 4 after comprehensive sorting may be: text segment 1, text segment 2, text segment 3, text segment 4, that is, text segment 1 is adjacent to text segment 2, and text segment 1 is adjacent to text segment 2.
  • Fragment 2 is adjacent to text fragment 1 and text fragment 3
  • text fragment 3 is adjacent to text fragment 2 and text fragment 4 .
  • text fragment 1 and text fragment 2 are adjacent text fragments, text fragment 1 is the previous text fragment, and text fragment 2 is the next text fragment.
  • text fragment 2 and text fragment 3 are also adjacent text fragments, text fragment 2 is the previous text fragment, text fragment 3 is the next text fragment, and so on.
  • the embodiment of the present disclosure displays text segment 1.
  • the end display time of the text segment 1 is updated to the start display time of the text segment 2, so as to avoid the overlapping of the display times of the text segment 1 and the text segment 2.
  • the display time compression means that the display of the same text segment is completed within a shorter display time. For example, the text segment 1 in Figure 3 "This sentence is so long" needs to be displayed within the time period after the display time is compressed, that is, within the time determined by the start display time of text segment 1 and the start display time of text segment 2 displayed during the time period.
  • the preprocessed text segments are combined based on the time axis to generate subtitles of the target audio and video file.
  • the text segment 1 corresponding to track A “This sentence is so long” is merged with the text segment 2 corresponding to track B “one two three four five”, the final selected subtitle is generated.
  • the text fragments with the same start display time before merging the text fragments corresponding to each audio clip, determine the text fragments with the same start display time, and if the end display times of the text fragments with the same start display time are different, determine to end the display
  • the subtitles of the target audio and video file are generated based on the text segment, and other text segments with the same start display time except the text segment with the latest end display time are deleted.
  • the embodiments of the present disclosure generate subtitles based on the text segment with the latest display end time among the text segments with the same start display time, that is, generate subtitles based on the text segment with a longer display time, which can avoid loss of subtitle content as much as possible.
  • the audio data on each audio track in the at least one audio track is voiced respectively. Identify, get the text fragment corresponding to each audio track. Then, based on the text segment corresponding to each audio track, the subtitle of the target audio and video file is generated.
  • the embodiment of the present disclosure performs independent speech recognition for the audio data on each audio track, avoids the influence of the audio tracks on each other, and can obtain more accurate speech recognition. The voice recognition results are obtained, thereby improving the accuracy of the subtitles generated based on the voice recognition results.
  • the subtitles of the target audio and video files may be displayed based on the time axis according to a preset subtitle display mode.
  • FIG. 4 which is a schematic diagram of a subtitle display interface provided by an embodiment of the present disclosure
  • subtitles are displayed above the audio track in FIG. 4 based on the time axis.
  • the three areas on the subtitle display interface (as shown in area 1, area 2, and area 3 in FIG. 4) respectively display subtitles synchronously (for example, "Why are others reading comics?"), for the subtitles of the target audio and video files
  • the text in the subtitles can be displayed in the default font, color, font size, etc. to improve the display effect of the subtitles, thereby improving the user experience.
  • the subtitles may also be adjusted. Specifically, after receiving the adjustment operation for the subtitles, the subtitles are displayed and updated.
  • adjustment operations include addition operations, deletion operations, or modification operations.
  • FIG. 5 is a schematic diagram of another subtitle display interface provided by an embodiment of the present disclosure
  • a user can click on any paragraph of text in the displayed subtitles to trigger operations such as modifying and deleting the paragraph of text, and can also trigger Modify the characteristics of the text (for example, font, color, font size, etc.), etc.
  • the user can also click the blank position in the subtitle display area to trigger the display of the input box, and enter the added subtitle in the input box. After the content, the operation of adding subtitles is triggered to realize the addition of subtitle content.
  • the user can revise the generated subtitles as required to obtain more accurate subtitles.
  • the subtitles of the target audio and video file are displayed time-compressed based on the variable speed playback multiple of the target audio and video file, and then follow the variable speed processed target audio and video files.
  • the playback of the file displays the time-compressed subtitles.
  • the display time of the subtitles of the target audio and video file is proportionally compressed to half the original display time.
  • a subtitle generating apparatus provided in an embodiment of the present disclosure includes:
  • Recognition module 601 for in response to the subtitle generation trigger operation for at least one audio track in the target audio and video file, respectively carry out speech recognition to the audio data on each audio track in the at least one audio track, and obtain each audio track the corresponding text fragment;
  • the generating module 602 is configured to generate subtitles of the target audio-video file based on the text segment corresponding to each audio track.
  • all text fragments corresponding to the at least one audio track have a start display time and an end display time
  • the generating module 602 includes:
  • a sorting submodule for comprehensively sorting all the text fragments based on the start display time of each text fragment in the all text fragments
  • the judgment submodule is used for judging whether the end display time of the previous text fragment in the comprehensively sorted adjacent text fragments is later than the start display time of the next text fragment;
  • a compression submodule configured to compress the display time of the previous text fragment when the end display time of the previous text fragment is later than the start display time of the next text fragment, so that the previous text fragment The end display time of the segment is not later than the start display time of the last text segment;
  • a generating sub-module is configured to combine all the text segments based on the time axis to generate subtitles of the target audio and video file.
  • the device further includes:
  • a determining module configured to determine at least one text segment with the same start display time among all the text segments, and determine the text segment with the latest end display time from the at least one text segment;
  • a deletion module configured to delete other text fragments in the at least one text fragment except the text fragment with the latest display end time.
  • the device further includes:
  • An update module configured to update the subtitle in response to an adjustment operation for the subtitle, wherein the adjustment operation includes an addition operation, a deletion operation or a modification operation.
  • the device further includes:
  • a time compression module configured to perform display time compression on the subtitles of the target audio and video file based on the variable speed playback multiple of the target audio and video file.
  • the apparatus for generating subtitles when receiving a triggering operation for generating subtitles for at least one audio track in a target audio-video file, respectively performs speech recognition on the audio data on each audio track in the at least one audio track , to get the text fragment corresponding to each audio track. Then, based on the text segment corresponding to each audio track, the subtitle of the target audio and video file is generated.
  • the embodiment of the present disclosure performs independent speech recognition for the audio data on each audio track, avoids the influence of the audio tracks on each other, and can obtain more accurate speech recognition. The voice recognition results are obtained, thereby improving the accuracy of the subtitles generated based on the voice recognition results.
  • an embodiment of the present disclosure also provides a subtitle generating device, as shown in FIG. 7 , which may include:
  • Processor 701 , memory 702 , input device 703 and output device 704 The number of processors 701 in the subtitle generating device may be one or more, and one processor is taken as an example in FIG. 7 .
  • the processor 701 , the memory 702 , the input device 703 and the output device 704 may be connected by a bus or in other ways, wherein the connection by a bus is taken as an example in FIG. 7 .
  • the memory 702 can be used to store software programs and modules, and the processor 701 executes various functional applications and data processing of the subtitle generating device by running the software programs and modules stored in the memory 702 .
  • the memory 702 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like. Additionally, memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the input device 703 may be used to receive input numerical or character information, and to generate signal input related to user settings and function control of the subtitle generating device.
  • the processor 701 loads the executable files corresponding to the processes of one or more application programs into the memory 702 according to the following instructions, and the processor 701 runs the executable files stored in the memory 702 application, so as to realize various functions of the above-mentioned subtitle generating device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Acoustics & Sound (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Studio Circuits (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本公开提供了一种字幕生成方法、装置、设备及存储介质,所述方法包括:在接收到针对目标音视频文件中至少一个音频轨道的字幕生成触发操作时,分别对该至少一个音频轨道中的每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段。然后,基于每个音频轨道对应的文字片段,生成目标音视频文件的字幕。与针对所有音频轨道上的音频数据进行整体语音识别相比,本公开实施例针对每个音频轨道上的音频数据分别进行独立的语音识别,避免了音频轨道彼此之间的影响,能够得到更准确的语音识别结果,进而提高了基于语音识别结果生成的字幕的准确性。

Description

字幕生成方法、装置、设备及存储介质
本申请要求于2020年7月23日递交的中国专利申请第202010719394.1号的优先权,该中国专利申请的全文以引入的方式并入以作为本申请的一部分。
技术领域
本公开涉及数据处理领域,尤其涉及一种字幕生成方法、装置、设备及存储介质。
背景技术
为音视频文件生成字幕,是指对音视频文件进行语音识别,并将识别结果作为该音视频文件的字幕。
目前,音视频文件的字幕是针对音视频文件中所有音频轨道上的音频数据进行整体语音识别后得到的识别结果。由于每个音频轨道上的音频数据彼此之间可能存在影响,例如在同一时间段,多个音频轨道上可能均存在音频数据,从听觉感受角度而言,可能存在听不清楚的问题,而如果针对该时间段的所有音频轨道上的音频数据进行整体识别,则可能存在识别不准确的问题,进而导致为音视频文件生成的字幕也存在不准确的问题。
因此,如何提高为音视频文件生成的字幕的准确性,是目前亟需解决的技术问题。
发明内容
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种字幕生成方法、装置、设备及存储介质,能够提高为音视频文件生成的字幕的准确性。
第一方面,本公开提供了一种字幕生成方法,所述方法包括:
响应于针对目标音视频文件中至少一个音频轨道的字幕生成触发操作,分别对所述至少一个音频轨道中每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段;
基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕。
一种可选的实施方式中,所述至少一个音频轨道对应的所有文字片段均具有开始显示时间和结束显示时间,基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕,包括:
基于所述所有文字片段中的每个文字片段的开始显示时间,对所述所有文字片段进行综合排序;
判断综合排序后的相邻文字片段中前一个文字片段的结束显示时间是否晚于后一个文字片段的开始显示时间;
响应于所述前一个文字片段的结束显示时间晚于所述后一个文字片段的开始显示时间,则对所述前一个文字片段进行显示时间压缩,以使所述前一个文字片段的结束显示时间不晚于所述后一个文字片段的开始显示时间;
基于时间轴对所述所有文字片段进行合并,生成所述目标音视频文件的字幕。
一种可选的实施方式中,在所述基于时间轴对所述所有文字片段进行合并,生成所述目标音视频文件的字幕之前,还包括:
确定所述所有文字片段中的开始显示时间相同的至少一个文字片段,从所述至少一个文字片段中确定结束显示时间最晚的文字片段;
将所述至少一个文字片段中,除所述结束显示时间最晚的文字片段之外的其他文字片段删除。
一种可选的实施方式中,所述方法还包括:
响应于针对所述字幕的调整操作,对所述字幕进行更新,其中,所述调整操作包括增加操作、删除操作或修改操作。
一种可选的实施方式中,在所述基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕之后,还包括:
基于所述目标音视频文件的变速播放倍数,对所述目标音视频文件的字幕进行显示时间压缩。
第二方面,本公开提供了一种字幕生成装置,所述装置包括:
识别模块,用于响应于针对目标音视频文件中至少一个音频轨道的字幕生成触发操作,分别对所述至少一个音频轨道中每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段;
生成模块,用于基于每个音频轨道对应的文字片段,生成所述目标音视 频文件的字幕。
一种可选的实施方式中,所述至少一个音频轨道对应的所有文字片段均具有开始显示时间和结束显示时间;
所述生成模块,包括:
排序子模块,用于基于所述所有文字片段中的每个文字片段的开始显示时间,对所述所有文字片段进行综合排序;
判断子模块,用于判断综合排序后的相邻文字片段中前一个文字片段的结束显示时间是否晚于后一个文字片段的开始显示时间;
压缩子模块,用于在所述前一个文字片段的结束显示时间晚于所述后一个文字片段的开始显示时间时,对所述前一个文字片段进行显示时间压缩,以使所述前一个文字片段的结束显示时间不晚于所述后一个文字片段的开始显示时间;
生成子模块,用于基于时间轴对所述所有文字片段进行合并,生成所述目标音视频文件的字幕。
一种可选的实施方式中,所述装置还包括:
确定模块,用于确定所述所有文字片段中的开始显示时间相同的至少一个文字片段中,从所述至少一个文字片段中确定所述结束显示时间最晚的文字片段;
删除模块,用于将所述至少一个文字片段中,除所述结束显示时间最晚的文字片段之外的其他文字片段删除。
第三方面,本公开提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现上述的方法。
第四方面,本公开提供了一种设备,包括:存储器,处理器,及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现上述的方法。
本公开实施例提供的技术方案与现有技术相比具有如下优点:
本公开实施例提供了一种字幕生成方法,在接收到针对目标音视频文件中至少一个音频轨道的字幕生成触发操作时,分别对该至少一个音频轨道中的每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段。然后,基于每个音频轨道对应的文字片段,生成目标音视频文件的字 幕。与针对所有音频轨道上的音频数据进行整体语音识别相比,本公开实施例针对每个音频轨道上的音频数据分别进行独立的语音识别,避免了音频轨道彼此之间的影响,能够得到更准确的语音识别结果,进而提高了基于语音识别结果生成的字幕的准确性。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的一种字幕生成方法的流程图;
图2为本公开实施例提供的一种字幕生成界面的示意图;
图3为本公开实施例提供的一种对文字片段进行处理的示意图;
图4为本公开实施例提供的一种字幕显示界面的示意图;
图5为本公开实施例提供的另一种字幕显示界面的示意图;
图6为本公开实施例提供的一种字幕生成装置结构框图;
图7为本公开实施例提供的一种字幕生成设备结构框图。
具体实施方式
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。
目前,音视频文件的字幕是针对音视频文件中所有音频轨道上的音频数据进行整体语音识别后得到的识别结果,但是,每个音频轨道上的音频数据彼此之间可能存在影响,导致对音视频文件进行整体语音识别可能存在不准确的问题。
因此,本公开实施例提供了一种字幕生成方法,能够针对每个音频轨道上的音频数据分别进行独立的语音识别,避免了音频轨道彼此之间的影响,提高了语音识别结果的准确性,进而基于准确性更高的语音识别结果,能够生成准确性更高的音视频文件的字幕。
具体的,本公开实施例提供的字幕生成方法中,在接收到针对目标音视频文件中至少一个音频轨道的字幕生成触发操作时,分别对该至少一个音频轨道中的每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段。然后,基于每个音频轨道对应的文字片段,生成目标音视频文件的字幕。
基于此,本公开实施例提供了一种字幕生成方法,参考图1,为本公开实施例提供的一种字幕生成方法的流程图,该方法包括:
S101:响应于针对目标音视频文件中至少一个音频轨道的字幕生成触发操作,分别对所述至少一个音频轨道中每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段。
本公开实施例中的目标音视频文件可以为音频文件,也可以为视频文件。
实际应用中,目标音视频文件通常包括多个音频轨道,本公开实施例可以针对多个音频轨道中的部分或全部音频轨道触发对目标音视频文件的字幕生成操作,也即是,至少一个音频轨道可以为目标音射频文件包括的所有音频轨道中的部分或全部音频轨道。
一种可选的实施方式中,如图2所示,为本公开实施例提供的一种字幕生成界面的示意图,例如,用户可以通过选中界面中展示的一个或多个音频轨道,然后点击“生成字幕”按钮,触发针对选中的一个或多个音频轨道的生成字幕的操作。
本公开实施例中,在接收到针对目标音视频文件中至少一个音频轨道的字幕生成触发操作时,确定该至少一个音频轨道中每个音频轨道上的音频数据,然后对每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段。具体的语音识别方式本公开实施例不再赘述。
一种可选的实施方式中,由于一条音频轨道上的音频数据通常包括多个音频片段,针对每个音频片段进行语音识别后得到每个音频片段对应的文字片段,属于同一条音频轨道的音频片段分别对应的文字片段构成该音频轨道 对应的文字片段,也就是说,在本公开中,音频轨道对应的文字片段包括该音频轨道上的多个音频片段分别对应的文字片段,对音频轨道对应的文字片段的处理,也即是对该音频轨道上的多个音频片段分别对应的文字片段的处理。在被选中的至少一个音频轨道中的每条音频轨道上的音频数据均完成语音识别后,得到每条音频轨道对应的文字片段。
S102:基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕。
本公开实施例中,在获得目标音视频文件中每个音频轨道对应的文字片段之后,基于时间轴对各个文字片段进行合并,生成目标音视频文件的字幕。
实际应用中,由于音频轨道上的每个音频片段均具有开始时间和结束时间,相应的,音频片段对应的文字片段也具有开始显示时间和结束显示时间,具体的,音频片段的开始时间作为该音频片段对应的文字片段的开始显示时间,同时,音频片段的结束时间作为该音频片段对应的文字片段的结束显示时间。
由于各个文字片段的显示时间(即从开始显示时间到结束显示时间的时间段)可能存在重合,因此,本公开实施例在对各个文字片段进行合并之前,首先对各个文字片段进行预处理,以便于后续对各个文字片段的合并。
一种可选的实施方式中,为了便于对各个文字片段的处理,在对各个文字片段进行合并之前,首先基于每个文字片段的开始显示时间,对每个音频轨道对应的文字片段进行综合排序。通常开始显示时间越早的文字片段的综合排序的位置越靠前。
在综合排序后,相邻的文字片段之间可能存在显示时间的重合。需要说明的是,这里“相邻的文字片段”是指,基于每个文字片段的开始显示时间,对选中的一个或多个音频轨道对应的所有文字片段进行综合排序,并基于这种综合排序关系所确定的相邻的文字片段。因此,对于基于前述综合排序得到的相邻的文字片段中的两个待判断的文字片段,本公开实施例需要判断两个待判断的文字片段中的前一个文字片段的结束显示时间与后一个文字片段的开始显示时间之间的关系,以执行预处理。例如,这里“前一个文字片段”指代两个待判断的文字片段中的开始显示时间较早的文字片段,“后一个文字片段”指代两个待判断的文字片段中的开始显示时间较晚的文字片段,也即“前一个文字片段”的开始显示时间早于“后一个文字片段”的开 始显示时间。
例如,如果前一个文字片段的结束显示时间不晚于后一个文字片段的开始显示时间,则说明前一个文字片段与后一个文字片段的显示时间不存在重合。相反的,如果前一个文字片段的结束显示时间晚于后一个文字片段的开始显示时间,则说明前一个文字片段与后一个文字片段的显示时间存在重合,此时,需要对前一个文字片段进行显示时间压缩,以使前一个文字片段的结束显示时间不晚于后一个文字片段的开始显示时间,避免前一个文字片段与后一个文字片段的显示时间存在重合。
图3为本公开实施例提供的一种对文字片段进行处理的示意图。如图3所示,目标音视频文件中被选中的音频轨道包括轨道A、轨道B和轨道C,每个轨道(轨道A、轨道B或轨道C)所在行的多个矩形块用于表示每个轨道对应的文字片段,每个矩形块为文字片段,例如,轨道A所在的行包括四个矩形块,即轨道A对应的文字片段包括四个文字片段(例如,包括图3示出的文字片段1和3),轨道B所在的行包括三个矩形块,即轨道B对应的文字片段包括三个文字片段(例如,包括图3示出的文字片段2),轨道C所在的行包括三个矩形块,即轨道C对应的文字片段包括三个文字片段(例如,包括图3示出的文字片段4)。基于各个文字片段的开始显示时间进行排序,如图3所示,轨道A对应的文字片段包括文字片段1和文字片段3,轨道B对应的文字片段包括文字片段2,轨道C对应的文字片段包括文字片段4,文字片段1的开始显示时间最早,其次是文字片段2,然后是文字片段3,接着是文字片段4,以此类推,基于各个文字片段的开始显示时间对轨道A、轨道B和轨道C分别对应的文字片段进行综合排序。
例如,经过前述综合排序后的文字片段1至文字片段4的排序关系可以为:文字片段1,文字片段2,文字片段3,文字片段4,也即文字片段1与文字片段2相邻,文字片段2与文字片段1和文字片段3相邻,文字片段3与文字片段2和文字片段4相邻。例如,如图3所示,文字片段1和文字片段2为相邻的文字片段,文字片段1为前一个文字片段,文字片段2为后一个文字片段。类似地,文字片段2和文字片段3也为相邻的文字片段,文字片段2为前一个文字片段,文字片段3为后一个文字片段,以此类推。
针对综合排序后相邻的文字片段,判断前一个文字片段的结束显示时间是否不晚于后一个文字片段的开始显示时间,如图3所示,文字片段1和文 字片段2为综合排序后相邻的文字片段,显然,文字片段1的结束显示时间晚于文字片段2的开始显示时间,导致文字片段1与文字片段2的显示时间存在重合,因此,本公开实施例对文字片段1进行显示时间压缩,将文字片段1的结束显示时间更新为文字片段2的开始显示时间,以避免文字片段1与文字片段2的显示时间存在重合。其中,显示时间压缩是指在更短的显示时间内完成对相同文字片段的显示。例如,图3中的文字片段1“这句话这么长”需要在显示时间压缩后的时间段内进行显示,也即在由文字片段1的开始显示时间和文字片段2的开始显示时间确定的时间段内进行显示。
本公开实施例中,在对文字片段进行上述预处理后,基于时间轴对预处理后的各个文字片段进行合并,生成目标音视频文件的字幕。如图3所示,轨道A对应的文字片段1“这句话这么长”与轨道B对应的文字片段2“一二三四五”进行合并后,生成最终选取的字幕。
另一种可选的实施方式中,在对各个音频片段对应的文字片段进行合并之前,确定开始显示时间相同的文字片段,如果开始显示时间相同的文字片段的结束显示时间不同,则确定结束显示时间最晚的文字片段,基于该文字片段生成目标音视频文件的字幕,并删除开始显示时间相同的文字片段中除该结束显示时间最晚的文字片段之外的其他文字片段。本公开实施例基于开始显示时间相同的文字片段中结束显示时间最晚的文字片段生成字幕,即基于显示时间较长的文字片段生成字幕,能够尽量避免字幕内容的丢失。
另一种可选的实施方式中,在删除开始显示时间相同的文字片段中除结束显示时间最晚的文字片段之外的其他文字片段之后,继续对其他文字片段执行基于每个文字片段的所述开始显示时间,对每个音频轨道对应的文字片段进行综合排序的步骤,经过对各个文字片段的上述预处理操作后,生成目标音视频文件的字幕。
可以理解的是,如果目标音视频文件中仅一条音频轨道被选中用于为目标音视频文件生成字幕,则各个文字片段不存在显示时间重复的问题,因此不需要对各个文字片段进行合并,直接将该音频轨道对应的文字片段作为该目标音视频文件的字幕即可。
本公开实施例提供的字幕生成方法中,在接收到针对目标音视频文件中至少一个音频轨道的字幕生成触发操作时,分别对该至少一个音频轨道中的每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片 段。然后,基于每个音频轨道对应的文字片段,生成目标音视频文件的字幕。与针对所有音频轨道上的音频数据进行整体语音识别相比,本公开实施例针对每个音频轨道上的音频数据分别进行独立的语音识别,避免了音频轨道彼此之间的影响,能够得到更准确的语音识别结果,进而提高了基于语音识别结果生成的字幕的准确性。
一种应用场景中,在生成目标音视频文件的字幕之后,可以按照预设字幕显示方式,基于时间轴对目标音视频文件的字幕进行显示。参考图4,为本公开实施例提供的一种字幕显示界面的示意图,图4中的音频轨道的上方基于时间轴显示有字幕。另外,字幕显示界面上的三个区域(如图4中的区域1、区域2、区域3所示)分别同步显示字幕(例如,“为什么别人在看漫画”),对于目标音视频文件的字幕中的文字,可以以默认的字体、颜色、字号等进行显示,以提高字幕的展示效果,进而提高用户的体验。
另外,本公开实施例中还可以针对字幕进行调整,具体的,在接收到针对字幕的调整操作后,对字幕进行显示更新。例如,调整操作包括增加操作、删除操作或修改操作。
参考图5,为本公开实施例提供的另一种字幕显示界面的示意图,例如,用户可以通过点击显示的字幕中的任一段文字,触发对该段文字的修改、删除操作等,还可以触发对该段文字的特性(例如,字体、颜色、字号等)进行修改等操作,另外,用户还可以通过点击字幕显示区域中的空白位置,触发输入框的显示,在输入框中输入增加的字幕内容后,触发字幕的增加操作,实现对字幕内容的增加。
实际应用中,用户可以根据需求对生成的字幕进行修正,以得到更准确的字幕。
在另一种应用场景中,如果针对目标音视频文件存在变速处理,则基于目标音视频文件的变速播放倍数,对目标音视频文件的字幕进行显示时间压缩,然后跟随变速处理后的目标音视频文件的播放,对显示时间压缩后的字幕进行展示。
例如,假设目标音视频文件的变速播放倍数为2倍,则将目标音视频文件的字幕的显示时间等比例压缩至原显示时间的二分之一。
与上述方法实施例属于同一个发明构思,本公开还提供了一种字幕生成装置,参考图6,为本公开实施例提供的一种字幕生成装置,所述装置包括:
识别模块601,用于响应于针对目标音视频文件中至少一个音频轨道的字幕生成触发操作,分别对所述至少一个音频轨道中每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段;
生成模块602,用于基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕。
一种可选的实施方式中,所述至少一个音频轨道对应的所有文字片段均具有开始显示时间和结束显示时间;
所述生成模块602,包括:
排序子模块,用于基于所述所有文字片段中的每个文字片段的开始显示时间,对所述所有文字片段进行综合排序;
判断子模块,用于判断综合排序后的相邻文字片段中前一个文字片段的结束显示时间是否晚于后一个文字片段的开始显示时间;
压缩子模块,用于在所述前一个文字片段的结束显示时间晚于所述后一个文字片段的开始显示时间时,对所述前一个文字片段进行显示时间压缩,以使所述前一个文字片段的结束显示时间不晚于所述后一个文字片段的开始显示时间;
生成子模块,用于基于时间轴对所述所有文字片段进行合并,生成所述目标音视频文件的字幕。
一种可选的实施方式中,所述装置还包括:
确定模块,用于确定所述所有文字片段中的开始显示时间相同的至少一个文字片段,从所述至少一个文字片段中确定结束显示时间最晚的文字片段;
删除模块,用于将所述至少一个文字片段中,除所述结束显示时间最晚的文字片段之外的其他文字片段删除。
一种可选的实施方式中,所述装置还包括:
更新模块,用于响应于针对所述字幕的调整操作,对所述字幕进行更新;其中,所述调整操作包括增加操作、删除操作或修改操作。
一种可选的实施方式中,所述装置还包括:
时间压缩模块,用于基于所述目标音视频文件的变速播放倍数,对所述目标音视频文件的字幕进行显示时间压缩。
本公开实施例提供的字幕生成装置,在接收到针对目标音视频文件中至 少一个音频轨道的字幕生成触发操作时,分别对该至少一个音频轨道中的每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段。然后,基于每个音频轨道对应的文字片段,生成目标音视频文件的字幕。与针对所有音频轨道上的音频数据进行整体语音识别相比,本公开实施例针对每个音频轨道上的音频数据分别进行独立的语音识别,避免了音频轨道彼此之间的影响,能够得到更准确的语音识别结果,进而提高了基于语音识别结果生成的字幕的准确性。
另外,本公开实施例还提供了一种字幕生成设备,参见图7所示,可以包括:
处理器701、存储器702、输入装置703和输出装置704。字幕生成设备中的处理器701的数量可以一个或多个,图7中以一个处理器为例。在本发明的一些实施例中,处理器701、存储器702、输入装置703和输出装置704可通过总线或其它方式连接,其中,图7中以通过总线连接为例。
存储器702可用于存储软件程序以及模块,处理器701通过运行存储在存储器702的软件程序以及模块,从而执行字幕生成设备的各种功能应用以及数据处理。存储器702可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等。此外,存储器702可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。输入装置703可用于接收输入的数字或字符信息,以及产生与字幕生成设备的用户设置以及功能控制有关的信号输入。
具体在本实施例中,处理器701会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器702中,并由处理器701来运行存储在存储器702中的应用程序,从而实现上述字幕生成设备的各种功能。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或 者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (10)

  1. 一种字幕生成方法,其特征在于,所述方法包括:
    响应于针对目标音视频文件中至少一个音频轨道的字幕生成触发操作,分别对所述至少一个音频轨道中每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段;
    基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕。
  2. 根据权利要求1所述的方法,其特征在于,
    所述至少一个音频轨道对应的所有文字片段均具有开始显示时间和结束显示时间,
    基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕,包括:
    基于所述所有文字片段中的每个文字片段的开始显示时间,对所述所有文字片段进行综合排序;
    判断综合排序后的相邻文字片段中前一个文字片段的结束显示时间是否晚于后一个文字片段的开始显示时间;
    响应于所述前一个文字片段的结束显示时间晚于所述后一个文字片段的开始显示时间,则对所述前一个文字片段进行显示时间压缩,以使所述前一个文字片段的结束显示时间不晚于所述后一个文字片段的开始显示时间;
    基于时间轴对所述所有文字片段进行合并,生成所述目标音视频文件的字幕。
  3. 根据权利要求2所述的方法,其特征在于,在所述基于时间轴对所述所有文字片段进行合并,生成所述目标音视频文件的字幕之前,还包括:
    确定所述所有文字片段中的开始显示时间相同的至少一个文字片段,从所述至少一个文字片段中确定结束显示时间最晚的文字片段;
    将所述至少一个文字片段中,除所述结束显示时间最晚的文字片段之外的其他文字片段删除。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述方法还包括:
    响应于针对所述字幕的调整操作,对所述字幕进行更新,其中,所述调整操作包括增加操作、删除操作或修改操作。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,在所述基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕之后,还包括:
    基于所述目标音视频文件的变速播放倍数,对所述目标音视频文件的字幕进行显示时间压缩。
  6. 一种字幕生成装置,其特征在于,所述装置包括:
    识别模块,用于响应于针对目标音视频文件中至少一个音频轨道的字幕生成触发操作,分别对所述至少一个音频轨道中每个音频轨道上的音频数据进行语音识别,得到每个音频轨道对应的文字片段;
    生成模块,用于基于每个音频轨道对应的文字片段,生成所述目标音视频文件的字幕。
  7. 根据权利要求6所述的装置,其特征在于,所述至少一个音频轨道对应的所有文字片段均具有开始显示时间和结束显示时间;
    所述生成模块,包括:
    排序子模块,用于基于所述所有文字片段中的每个文字片段的开始显示时间,对所述所有文字片段进行综合排序;
    判断子模块,用于判断综合排序后的相邻文字片段中前一个文字片段的结束显示时间是否晚于后一个文字片段的开始显示时间;
    压缩子模块,用于在所述前一个文字片段的结束显示时间晚于所述后一个文字片段的开始显示时间时,对所述前一个文字片段进行显示时间压缩,以使所述前一个文字片段的结束显示时间不晚于所述后一个文字片段的开始显示时间;
    生成子模块,用于基于时间轴对所述所有文字片段进行合并,生成所述目标音视频文件的字幕。
  8. 根据权利要求7所述的装置,其特征在于,所述装置还包括:
    确定模块,用于确定所述所有文字片段中的开始显示时间相同的至少一个文字片段,从所述至少一个文字片段中确定结束显示时间最晚的文字片段;
    删除模块,用于将所述至少一个文字片段中,除所述结束显示时间最晚的文字片段之外的其他文字片段删除。
  9. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现如 权利要求1-5任一项所述的方法。
  10. 一种设备,其特征在于,包括:存储器,处理器,及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现如权利要求1-5任一项所述的方法。
PCT/CN2021/107845 2020-07-23 2021-07-22 字幕生成方法、装置、设备及存储介质 WO2022017459A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21845741.4A EP4171018A4 (en) 2020-07-23 2021-07-22 SUBTITLE GENERATION METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM
US18/087,631 US11837234B2 (en) 2020-07-23 2022-12-22 Subtitle generation method and apparatus, and device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010719394.1A CN111901538B (zh) 2020-07-23 2020-07-23 一种字幕生成方法、装置、设备及存储介质
CN202010719394.1 2020-07-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/087,631 Continuation US11837234B2 (en) 2020-07-23 2022-12-22 Subtitle generation method and apparatus, and device and storage medium

Publications (1)

Publication Number Publication Date
WO2022017459A1 true WO2022017459A1 (zh) 2022-01-27

Family

ID=73189315

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/107845 WO2022017459A1 (zh) 2020-07-23 2021-07-22 字幕生成方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US11837234B2 (zh)
EP (1) EP4171018A4 (zh)
CN (1) CN111901538B (zh)
WO (1) WO2022017459A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111901538B (zh) * 2020-07-23 2023-02-17 北京字节跳动网络技术有限公司 一种字幕生成方法、装置、设备及存储介质
CN112738563A (zh) * 2020-12-28 2021-04-30 深圳万兴软件有限公司 自动添加字幕片段的方法、装置及计算机设备
CN113259776B (zh) * 2021-04-14 2022-11-22 北京达佳互联信息技术有限公司 字幕与音源的绑定方法及装置
CN114363691A (zh) * 2021-04-22 2022-04-15 南京亿铭科技有限公司 语音字幕合成方法、装置、计算机设备及存储介质
CN113422996B (zh) * 2021-05-10 2023-01-20 北京达佳互联信息技术有限公司 字幕信息编辑方法、装置及存储介质
CN114501159B (zh) * 2022-01-24 2023-12-22 传神联合(北京)信息技术有限公司 一种字幕编辑方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102348071A (zh) * 2011-06-02 2012-02-08 上海东方传媒集团有限公司 一种实现节目的字幕制作方法及系统
CN105338394A (zh) * 2014-06-19 2016-02-17 阿里巴巴集团控股有限公司 字幕数据的处理方法及系统
US20180267772A1 (en) * 2017-03-20 2018-09-20 Chung Shan Lee Electronic device and processing method for instantly editing multiple tracks
CN108924583A (zh) * 2018-07-19 2018-11-30 腾讯科技(深圳)有限公司 视频文件生成方法及其设备、系统、存储介质
CN111901538A (zh) * 2020-07-23 2020-11-06 北京字节跳动网络技术有限公司 一种字幕生成方法、装置、设备及存储介质

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7139031B1 (en) * 1997-10-21 2006-11-21 Principle Solutions, Inc. Automated language filter for TV receiver
US20030065503A1 (en) * 2001-09-28 2003-04-03 Philips Electronics North America Corp. Multi-lingual transcription system
US7519274B2 (en) * 2003-12-08 2009-04-14 Divx, Inc. File format for multiple track digital data
US8131545B1 (en) * 2008-09-25 2012-03-06 Google Inc. Aligning a transcript to audio data
US20110020774A1 (en) * 2009-07-24 2011-01-27 Echostar Technologies L.L.C. Systems and methods for facilitating foreign language instruction
CN104575547B (zh) * 2013-10-17 2017-12-22 深圳市云帆世纪科技有限公司 多媒体文件制作方法、播放方法及系统
CN103761985B (zh) * 2014-01-24 2016-04-06 北京华科飞扬科技股份公司 一种多通道视音频在线式演播编辑系统
US10506295B2 (en) * 2014-10-09 2019-12-10 Disney Enterprises, Inc. Systems and methods for delivering secondary content to viewers
CN105704538A (zh) * 2016-03-17 2016-06-22 广东小天才科技有限公司 一种音视频字幕生成方法及系统
US20180018961A1 (en) * 2016-07-13 2018-01-18 Google Inc. Audio slicer and transcription generator
US20180211556A1 (en) * 2017-01-23 2018-07-26 Rovi Guides, Inc. Systems and methods for adjusting display lengths of subtitles based on a user's reading speed
US10580457B2 (en) * 2017-06-13 2020-03-03 3Play Media, Inc. Efficient audio description systems and methods
US20200126559A1 (en) * 2018-10-19 2020-04-23 Reduct, Inc. Creating multi-media from transcript-aligned media recordings
US11347379B1 (en) * 2019-04-22 2022-05-31 Audible, Inc. Captions for audio content
US11211053B2 (en) * 2019-05-23 2021-12-28 International Business Machines Corporation Systems and methods for automated generation of subtitles
US20210064327A1 (en) * 2019-08-26 2021-03-04 Abigail Ispahani Audio highlighter
US11183194B2 (en) * 2019-09-13 2021-11-23 International Business Machines Corporation Detecting and recovering out-of-vocabulary words in voice-to-text transcription systems
US11301644B2 (en) * 2019-12-03 2022-04-12 Trint Limited Generating and editing media
US11070891B1 (en) * 2019-12-10 2021-07-20 Amazon Technologies, Inc. Optimization of subtitles for video content
US11562743B2 (en) * 2020-01-29 2023-01-24 Salesforce.Com, Inc. Analysis of an automatically generated transcription
US11334622B1 (en) * 2020-04-01 2022-05-17 Raymond James Buckley Apparatus and methods for logging, organizing, transcribing, and subtitling audio and video content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102348071A (zh) * 2011-06-02 2012-02-08 上海东方传媒集团有限公司 一种实现节目的字幕制作方法及系统
CN105338394A (zh) * 2014-06-19 2016-02-17 阿里巴巴集团控股有限公司 字幕数据的处理方法及系统
US20180267772A1 (en) * 2017-03-20 2018-09-20 Chung Shan Lee Electronic device and processing method for instantly editing multiple tracks
CN108924583A (zh) * 2018-07-19 2018-11-30 腾讯科技(深圳)有限公司 视频文件生成方法及其设备、系统、存储介质
CN111901538A (zh) * 2020-07-23 2020-11-06 北京字节跳动网络技术有限公司 一种字幕生成方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4171018A4 *

Also Published As

Publication number Publication date
US11837234B2 (en) 2023-12-05
US20230128946A1 (en) 2023-04-27
CN111901538A (zh) 2020-11-06
EP4171018A4 (en) 2024-01-03
EP4171018A1 (en) 2023-04-26
CN111901538B (zh) 2023-02-17

Similar Documents

Publication Publication Date Title
WO2022017459A1 (zh) 字幕生成方法、装置、设备及存储介质
US8145999B1 (en) System and method for audio creation and editing in a multimedia messaging environment
US20190318764A1 (en) Apparatus, method, and program for creating a video work
US20240089586A1 (en) Image processing method, apparatus, device and storage medium
US20230185438A1 (en) Interaction method of audio-video segmentation, apparatus, device and storage medium
WO2023185432A1 (zh) 一种视频处理方法、装置、设备及存储介质
CN112732650A (zh) 文件分片方法及装置
CN109299352B (zh) 搜索引擎中网站数据的更新方法、装置和搜索引擎
CN108959527B (zh) 基于Windows文件映射技术读取显示联锁日志的方法
CN110874216B (zh) 一种完备代码生成方法、装置、设备和存储介质
US9507794B2 (en) Method and apparatus for distributed processing of file
CN110852045A (zh) 一种删除文档内容的方法、装置、电子设备及存储介质
US20240127860A1 (en) Audio/video processing method and apparatus, device, and storage medium
CN117369731B (zh) 一种数据的缩减处理方法、装置、设备及介质
WO2024094086A1 (zh) 图像处理方法、装置、设备、介质及产品
WO2023232073A1 (zh) 字幕生成方法、装置、电子设备、存储介质及程序
US20220147524A1 (en) Method for automatically generating news events of a certain topic and electronic device applying the same
WO2022148163A1 (zh) 一种音乐片段的定位方法、装置、设备及存储介质
WO2023104079A1 (zh) 一种模板更新方法、装置、设备及存储介质
JPH05204724A (ja) データベース格納制御方式
CN114885188A (zh) 视频处理方法、装置、设备以及存储介质
JP3293544B2 (ja) 補助記憶装置を用いたソート方式
CN115470386A (zh) 一种数据存储、数据检索方法、装置和电子设备
CN117215775A (zh) 文件扫描方法、装置、计算机设备及存储介质
JP2650803B2 (ja) フルスクリーンエディタ制御処理装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21845741

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021845741

Country of ref document: EP

Effective date: 20230119

NENP Non-entry into the national phase

Ref country code: DE