WO2023232073A1 - 字幕生成方法、装置、电子设备、存储介质及程序 - Google Patents

字幕生成方法、装置、电子设备、存储介质及程序 Download PDF

Info

Publication number
WO2023232073A1
WO2023232073A1 PCT/CN2023/097415 CN2023097415W WO2023232073A1 WO 2023232073 A1 WO2023232073 A1 WO 2023232073A1 CN 2023097415 W CN2023097415 W CN 2023097415W WO 2023232073 A1 WO2023232073 A1 WO 2023232073A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
segments
audio
subtitle
data
Prior art date
Application number
PCT/CN2023/097415
Other languages
English (en)
French (fr)
Inventor
郑鑫
邓乐来
陈柯宇
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2023232073A1 publication Critical patent/WO2023232073A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Definitions

  • the present disclosure relates to the field of multimedia technology, and in particular, to a subtitle generation method, device, electronic device, storage medium and program.
  • Subtitles refer to the text content generated based on the dialogue, explanatory information and other information in the video and used to display the text content in the video frame image. Generating subtitles for videos is extremely important because they help users understand video content.
  • the way to generate subtitles for a video is usually to extract the audio from the video after the video is generated, perform speech recognition on the extracted audio, and obtain the text corresponding to the audio, and then perform punctuation recovery on the text to obtain text fragments. , display the text fragment in the corresponding video frame image according to the time corresponding to the text fragment.
  • an embodiment of the present disclosure provides a subtitle generation method, including:
  • the pronunciation object information and the timestamp information of the audio segments corresponding to each character the text data is segmented to obtain multiple text segments; the audio segments corresponding to each character in the text segment are Belong to the same pronunciation object, and the duration of the blank segments in the audio segments corresponding to the text segments is less than the preset duration;
  • subtitle data corresponding to the video to be processed is generated.
  • merging the multiple text segments based on the semantics of each text segment and the timestamp information of the audio segment corresponding to the text segment includes:
  • merging the multiple text segments based on the semantics of each text segment and the timestamp information of the audio segment corresponding to the text segment includes:
  • the adjacent text fragment with a shorter pause duration between the text fragment and the corresponding audio fragment will be Merge.
  • the preset sentence length requirement for a single subtitle includes at least one of: CPS requirements for characters per second or a maximum display duration requirement for a single subtitle in a video.
  • the text data is segmented to obtain multiple text segments based on the multiple segmentation positions, the pronunciation object information and the timestamp information of the audio segments corresponding to each character, including:
  • the text processing module includes: a sub-module that performs segmentation based on the plurality of segmentation positions, a sub-module that performs text segmentation based on the pronunciation object information of the audio segments corresponding to each of the characters, and a sub-module that performs segmentation based on the pronunciation object information of the audio segments corresponding to each of the characters.
  • the text processing module includes: a first segmentation module for segmenting text data based on punctuation analysis, a second segmentation module for segmenting text data based on grammatical characteristics, and a third segmentation module for segmenting based on pronunciation object information corresponding to the audio data.
  • the three segmentation modules and the fourth segmentation module segment based on the timestamp information of the audio segments corresponding to each character in the text data.
  • the first segmentation module, the second segmentation module, the third segmentation module and the fourth segmentation module are connected in a serial manner, and the input of the third segmentation module It includes the output of the second segmentation module and the audio data, and the input of the fourth segmentation module includes the output of the third segmentation module and the timestamp information of the audio segments corresponding to each of the characters.
  • the first segmentation module, the second segmentation module, the third segmentation module and the fourth segmentation module are connected in a parallel manner, and the first segmentation module and the The input of the second segmentation module includes the text data, the input of the third segmentation module includes the text data and the audio data, and the input of the fourth segmentation module includes the corresponding characters of each character in the text data.
  • the subtitle data is a text format subtitle (SubRip Text, SRT) file.
  • the method further includes: fusing the subtitle data with the video to be processed to obtain a target video with subtitles.
  • an embodiment of the present disclosure provides a subtitle generation device, including:
  • An audio processing module used to extract audio data in the video to be processed, perform speech recognition on the audio data, and obtain text data corresponding to the audio data;
  • An acquisition module configured to acquire multiple segmentation positions determined based on syntax analysis of the text data and pronunciation object information and timestamp information of the audio segments corresponding to each character included in the text data;
  • the text segmentation module is used for the multiple segmentation positions, the pronunciation object information and the timestamp information of the audio segments corresponding to each character, segmenting the text data to obtain multiple text segments; each of the text segments is The audio segments corresponding to the characters belong to the same pronunciation object, and the duration of the blank segments in the audio segments corresponding to the text segments is less than the preset duration;
  • a merging module configured to merge the multiple text segments according to the semantics of each of the text segments and the timestamp information of the audio segments corresponding to each of the characters, to obtain multiple semantically smooth and satisfying single subtitle sentence lengths. required merge fragments;
  • a generation module configured to generate subtitle data corresponding to the video to be processed based on the multiple merged segments.
  • an embodiment of the present disclosure also provides an electronic device, including:
  • a processor coupled to the memory, the processor configured to execute the first aspect and the subtitle generation method according to any one of the first aspects based on instructions stored in the memory.
  • embodiments of the present disclosure further provide a readable storage medium with a computer program stored thereon, the When the program is executed by the processor, the subtitle generation method as described in any one of the first aspect and the first aspect is implemented.
  • embodiments of the present disclosure further provide a computer program product, including: an electronic device executing the computer program product, so that the electronic device implements the first aspect and the subtitle generation method described in any one of the first aspects. .
  • an embodiment of the present disclosure further provides a computer program, including: instructions that, when executed by a processor, cause the processor to perform subtitle generation as described in the first aspect and any one of the first aspects. method.
  • Figure 1 is a flow chart of a subtitle generation method according to an embodiment of the present disclosure
  • Figure 2 is a flow chart of a subtitle generation method provided by an embodiment of the present disclosure
  • Figure 3 is a flow chart of a subtitle generation method provided by an embodiment of the present disclosure.
  • Figure 4 is a flow chart of a subtitle generation method provided by another embodiment of the present disclosure.
  • Figure 5 is a flow chart of a subtitle generation method provided by another embodiment of the present disclosure.
  • Figure 6 is a schematic structural diagram of a subtitle generation device provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • generating subtitles for videos usually includes the following process: extracting audio from the video, performing speech recognition on the audio data, obtaining text data corresponding to the audio data, and then performing punctuation recovery on the text data to obtain the segmented text fragments; then generate subtitle data based on the time of the video fragments corresponding to these text fragments, and fuse the subtitle data with the video to obtain a video with subtitles.
  • this method when performing fragmentation processing of text data, relying on the results of punctuation recovery, the sentence length of a single subtitle cannot be well controlled, thereby affecting the typesetting of the subtitles and the display duration of the subtitles in the video, and reducing the subtitles.
  • the subjective experience cannot have a good auxiliary understanding effect.
  • a single subtitle has a long sentence length, that is, a single subtitle data contains a large number of characters, and the display screen size of electronic devices is limited.
  • the subtitle needs to be folded and displayed, that is, it needs to be displayed in multiple lines; however, the number of lines occupied by the subtitle
  • the blocking area of the subtitles will expand, which may block more of the video screen and affect the user's viewing of the video content.
  • the sentence length of a single subtitle is longer, and the display time of a single subtitle in the video will increase, which will also affect the user's viewing.
  • Video content is a long sentence length, that is, a single subtitle data contains a large number of characters, and the display screen size of electronic devices is limited.
  • the subtitle needs to be folded and displayed, that is, it needs to be displayed in multiple lines; however, the number of lines occupied by the subtitle
  • the blocking area of the subtitles will expand, which may block more of the video screen and affect the user's viewing of the video content.
  • Another example is that some short sentences are spoken at a fast speed, and the sentence length of a single subtitle is shorter. That is, a single subtitle data contains a smaller number of characters, but the pronunciation duration of each character is shorter. Therefore, the display duration of the subtitles in the video is longer. If it is short, the user may not have time to read the subtitle content in detail, and the purpose of assisting understanding of the subtitles cannot be achieved.
  • the same text with different pause lengths may express different semantics.
  • the subtitles obtained through punctuation recovery may not accurately express the semantics of the same text at different audio positions.
  • the present disclosure provides a method for generating subtitles by extracting the audio data in the video to be processed and performing speech recognition on the audio data to obtain corresponding text data; by obtaining multiple segmentation positions determined based on syntax analysis of the text data As well as the pronunciation object information and timestamp information of the audio segments corresponding to each character included in the text data; based on multiple segmentation positions, the pronunciation object information and timestamp information of the audio segments corresponding to each character, the text data is segmented to obtain multiple text fragments that meet the requirements; and then merge multiple text fragments based on the semantics of each text fragment and the timestamp information of the audio fragment corresponding to each character, to obtain multiple merges that are semantically smooth and meet the preset single subtitle sentence length requirements Fragments; generate subtitle data corresponding to the video to be processed based on multiple merged fragments.
  • the disclosed method can better control the sentence length of a single subtitle and the display duration of a single subtitle in the video, greatly improving the auxiliary understanding effect of the subtitles.
  • the blank duration between the audio segments corresponding to the characters is fully considered, so that the same speech content expressing different meanings is segmented and merged in different ways. Therefore, this method can also Effectively reduce the occurrence of ambiguity.
  • the subtitle generation method provided in this embodiment can be executed by an electronic device.
  • Electronic devices can be tablets, mobile phones (such as folding screen mobile phones, large-screen mobile phones, etc.), wearable devices, vehicle-mounted devices, augmented reality (AR)/virtual reality (VR) devices, laptops, ultrasonic devices, etc.
  • Internet of Things the Internet of things (IoT) devices
  • this disclosure does not place any restrictions on the specific types of electronic devices. Among them, this disclosure does not limit the type of operating system of the electronic device. For example, Android system, Linux system, Windows system, iOS system, etc.
  • the present disclosure takes an electronic device as an example, and elaborates on the subtitle generation method provided by the present disclosure in conjunction with the accompanying drawings and application scenarios.
  • Figure 1 is a flow chart of a subtitle generation method provided by an embodiment of the present disclosure. Referring to Figure 1, the method in this embodiment includes the following steps.
  • the video to be processed is a video to be added with subtitles.
  • the electronic device can obtain the video to be processed.
  • the video to be processed can be recorded by the user through the electronic device, downloaded from the Internet, or processed by the user through the video.
  • This disclosure does not limit the implementation method of obtaining the video to be processed.
  • this disclosure does not limit the video content, duration, storage format, definition and other parameters in the video to be processed.
  • the electronic device is capable of extracting audio data from the video to be processed and converting the audio data into text data.
  • the electronic device can convert audio data into text data through a speech recognition model.
  • the speech recognition model can be a deep neural network module or a convolutional neural network model. etc.
  • the electronic device can also use other existing speech recognition tools or methods to convert audio data into text data. This disclosure does not limit the implementation of speech recognition by electronic devices.
  • the text data may include a continuous sequence of characters.
  • the text data includes "Today I am very happy that my parents and I went to the amusement park", which does not contain punctuation marks.
  • the audio data can correspond to one or more language types
  • the generated text data can also include characters corresponding to one or more language types.
  • the speech recognition result for the audio clip is "Hello", or the speech recognition result is "hello". Since the proportion of Chinese characters in the entire text data is relatively high, if the consistency of the language types in the subtitles is improved, For the purpose, you can choose the former. If the purpose is to make the subtitles more interesting, you can choose the latter.
  • the electronic device can analyze the text data through the syntax analysis model and obtain multiple segmentation positions.
  • grammatical analysis can include: punctuation position analysis, grammatical feature analysis, etc. Through grammatical analysis, multiple clause positions can be obtained, and the clause position is the segmentation position.
  • the electronic device can identify the audio clips corresponding to different pronunciation objects by performing pronunciation object recognition on the audio data, and then combine the correspondence between the audio clips corresponding to the different pronunciation objects and the text data to obtain the audio clips corresponding to each character. pronunciation object information.
  • the electronic device can segment the audio data to obtain the timestamp information of the audio segment corresponding to each character.
  • the timestamp information can include the start time and the end time.
  • each segmented text segment the audio segments corresponding to each character belong to the same pronunciation object, and the duration of the blank segments in the audio segments corresponding to the text segment is less than the preset duration.
  • Segmenting text data into multiple text segments can be achieved through a text processing module.
  • the text processing module can include multiple sub-modules, each sub-module is used to segment the input text data according to the characteristics of one or more dimensions. After the text data is processed by multiple sub-modules, the text data can be divided into multiple first text segments.
  • the semantics of text fragments can be obtained by semantic analysis of text fragments. Based on the semantics, it can be judged whether the content to be expressed by adjacent text fragments is continuous and fluent, which can then be used as a basis to guide the merging of text fragments and avoid text fragments with inconsistent semantics. Combined together, they bring a bad experience to users.
  • the pause duration between text segments can be obtained based on the timestamp information of the audio segments corresponding to each character. Specifically, the relative time can be determined based on the end time of the audio segment corresponding to the last character of the previous text segment and the start time of the audio segment corresponding to the first character of the following text segment. The length of pause between adjacent text fragments. During the merging process, two adjacent text segments may be tended to be merged with shorter pause durations between the text segments. A shorter pause can indicate that the content to be expressed in the audio data is more continuous, and when combined together, the content in the audio data can be expressed more completely, which is more conducive to user understanding.
  • text fragment 1, text fragment 2 and text fragment 3 are three consecutive first text fragments. Among them, it is determined based on semantics that text fragment 1 and text fragment 2 can be merged, text fragment 2 and text fragment 3 can be merged, and the pause duration between text fragment 1 and text fragment 2 is t1, and the pause duration between text fragment 2 and text fragment 3 is t1.
  • the pause duration is t2, and t1 is less than t2, then it is more reasonable to merge text fragment 1 and text fragment 2; in addition, text fragment 1 and text fragment 2 can meet the preset single subtitle sentence length requirement after merging, so the conditions for merging are met; Then you can merge text fragment 1 with text fragment 2.
  • the merged segment obtained by merging text segment 1 and text segment 2 may be the merged segment corresponding to the final single subtitle, or the merged segment may need to be merged with the adjacent text segment 3 to obtain a single subtitle. Corresponding merged fragments.
  • Each merged segment corresponds to a subtitle, and multiple merged segments are converted into subtitle files in a preset format in order to obtain subtitle data corresponding to the video to be processed.
  • the subtitle data can be, but is not limited to, an SRT file.
  • the method provided in this embodiment can better control the sentence length of a single subtitle and the display duration of a single subtitle in the video by combining the characteristics of the text dimension and the audio dimension to segment the text data and merge the segmented text segments. , without affecting semantic understanding, greatly improving the auxiliary understanding effect of subtitles; in addition, this method can also effectively reduce the occurrence of ambiguity.
  • the text processing module includes: a first segmentation module that segments text data based on punctuation analysis, a second segmentation module that segments text data based on grammatical characteristics, and a second segmentation module that segments text data based on grammatical characteristics.
  • a third segmentation module that segments the pronunciation object information corresponding to the frequency data and a fourth segmentation module that segments based on the timestamp information of the audio segments corresponding to each character in the text data.
  • FIG. 2 is a schematic structural diagram of a text processing module provided by an embodiment of the present disclosure. Please refer to Figure 2.
  • the output end of the first dividing module is connected to the input end of the second dividing module.
  • the output end of the second dividing module is connected to the input end of the third dividing processing module.
  • the output end of the third module is connected to the input end of the third dividing module.
  • Input connections for the four-split module Combined with the structure of the text processing module in the embodiment shown in Figure 2, each segmentation module included in the text processing module can be understood to be connected in a serial manner.
  • the first segmentation module is used to receive text data as input, perform punctuation analysis on the text data (which can also be understood as punctuation recovery processing) and obtain the segment positions of multiple punctuation marks. Based on these segment positions, the text data can be segmented into text Fragment; input the text fragments output by the first segmentation module to the second segmentation module, analyze the grammatical characteristics of the text fragments to determine multiple segmentation positions, and re-segment or adjust the text segments of the first segmentation module based on the multiple segmentation positions. , obtain multiple text segments; input the text segments and audio data output by the second segmentation module to the third segmentation module, and the third segmentation module performs pronunciation object recognition on the audio data to determine the starting positions of audio segments corresponding to different pronunciation objects.
  • the four-segment module determines the pause duration of adjacent characters based on the start time and end time of the audio clip corresponding to each character, and based on the comparison result between the pause duration of adjacent characters and the preset duration (i.e., duration threshold), the pause Adjacent characters whose duration is less than the preset duration are divided into one text fragment, and adjacent characters whose pause duration is greater than or equal to the preset duration are divided into two different text fragments.
  • the multiple text fragments output by the last sub-module (i.e., the fourth segmentation module) included in the text processing module are the final segmentation results corresponding to the text data.
  • the preset duration can be 0.4 seconds, 0.5 seconds, 0.6 seconds, etc.
  • the preset duration can be based on the pauses between the audio clips corresponding to each character in a large amount of audio data. The duration is obtained through statistical analysis.
  • each segmentation module included in the text processing module can be implemented using a corresponding machine learning model.
  • the first segmentation module can be implemented based on a pre-trained punctuation recovery processing model
  • the second segmentation module can be implemented based on The pre-trained grammatical feature analysis model is implemented.
  • the third segmentation module that is, the pronunciation object segmentation module processing module
  • the fourth segmentation module that is, the pause duration segmentation module, can be implemented based on the pre-trained character processing model. accomplish.
  • This disclosure does not limit the type of machine learning model used by each segmentation module, model parameters, etc.
  • FIG. 3 is a schematic structural diagram of a text processing module provided by an embodiment of the present disclosure. Please refer to Figure 3.
  • Each segmentation module included in the text processing module is connected in parallel.
  • the first segmentation module and the second segmentation module receive original text data as input respectively;
  • the third segmentation module receives audio data and original text. data as input;
  • the fourth segmentation module receives original text data as input, and each character included in the text data carries timestamp information.
  • Each segmentation module included in the text processing module determines the segmentation position based on its own input to segment the text data, and then fuses the segmentation results of the text data respectively output by each segmentation module to obtain multiple text segments.
  • connection method of each segmentation module included in the text processing module is not limited to the above examples in Figure 2 and Figure 3, and can also be implemented in other ways.
  • the serial connection method and the parallel connection method can be combined.
  • the first dividing module and the second dividing module are connected in a serial way
  • the third dividing module and the fourth dividing module are connected in a serial way.
  • the first splitting module and the second splitting module are connected as a whole and the third splitting module and the fourth splitting module are connected in parallel as another whole.
  • connection order of the segmentation modules included in the text processing module can be flexibly adjusted according to different scenarios. For example, in a scene with many pronunciation objects, segmentation processing can be performed based on the pronunciation objects first, and then based on Punctuation analysis, grammatical feature analysis, and timestamp information of the audio clips corresponding to each character are segmented.
  • Figure 4 is a flow chart of a subtitle generation method provided by an embodiment of the present disclosure.
  • the embodiment shown in FIG. 4 is mainly used to exemplarily introduce how an electronic device merges text fragments.
  • the electronic device when merging text fragments, the electronic device can achieve this by calling the merging module.
  • the merging module includes: indicator module, semantic analysis module, pause duration comparison module and text splicing module.
  • the indicator module can determine whether the merged two input text fragments meet the preset subtitle sentence length requirements.
  • the preset subtitle sentence length requirement is mainly the retention time of a single subtitle in the video.
  • the preset subtitle sentence length requirement can be the preset maximum number of characters per second (CPS), or the preset maximum display duration of a single subtitle in the video.
  • CPS preset maximum number of characters per second
  • the above two indicators can also better reflect the length of a single subtitle. How long the subtitles remain in the video.
  • the semantic analysis module can determine whether the two input text fragments can be merged based on the semantics corresponding to them, and output identification information to the text splicing module indicating whether the text fragments can be merged. For example, the semantic analysis module outputs identification 1 to indicate that the text fragments can be merged. , the output flag 0 means that merging is not possible.
  • the pause duration comparison module is used to determine the pause duration comparison results between multiple adjacent text segments based on the timestamp information of the audio segments corresponding to each character included in the text segment.
  • the text splicing module combines the results or instruction information output by the aforementioned indicator module, semantic analysis module, and pause duration comparison module to determine the merging plan, and combines text fragments that meet the preset subtitle sentence length requirements, are semantically smooth, and have short pause durations between text fragments. Splice to obtain multiple merged segments.
  • the indicator module and the semantic analysis module can exchange data.
  • the indicator module can output the judgment results to the semantic analysis module, and the semantic analysis module can judge the combination of text fragments that meet the preset subtitle sentence length requirements. Combinations of text fragments that meet the preset subtitle sentence length requirements do not determine whether the semantics are continuous and fluent, thereby reducing the workload of the semantic analysis module and improving the efficiency of subtitle generation.
  • N text fragments are obtained, namely text fragment 1, text fragment 2 to text fragment N.
  • the electronic device may sequentially determine whether the combination of text segment 1 and text segment 2, and text segment 2 and text segment 3 meet the preset subtitle sentence length requirement. If it is determined based on semantic features that text fragment 1 and text fragment 2 can be merged, and text fragment 2 and text fragment 3 can also be merged, but there is a long pause between text fragment 1 and text fragment 2, then text fragment 1 and text fragment 2 Merge to obtain merged fragment 1. After that, the electronic device can determine whether the merged segment 1 and the text segment 3 can be merged according to the preset subtitle sentence length requirements and the semantics of the text segment. If they can be merged, the merged segment 1 and the text segment 3 are merged to obtain a new merged segment 1. .
  • the electronic device can also determine whether text segment 3 and text segment 4 can be merged based on the preset subtitle sentence length requirement and the semantics of the text segment. If they can be merged, text segment 3 and text segment 4 are merged to obtain merged segment 2.
  • the electronic device can compare the subtitle effect of the merged segment obtained by merging the new merged segment 1 with the text segment 3 and the subtitle effect of the merged segment obtained by merging the text segment 3 with the text segment 4, and determine the final merging plan of the text segment 3.
  • the three steps include determining whether the merging of two text fragments meets the preset subtitle sentence length requirements, determining whether the two text fragments can be merged based on the semantics of the two text fragments, and comparing the pause duration between the audio fragments corresponding to the adjacent text fragments. It can be executed in parallel, and then the judgment results output by the three are combined for merging.
  • the above merging can go through multiple rounds of processing. For example, if the sentence lengths of the merged fragments obtained in the first round of merging are all shorter, the merged fragments obtained in the first round can be used as input, and then a process can be executed. Through round merging processing, the sentence length of a single subtitle is infinitely close to the preset subtitle sentence length requirement.
  • merging can be performed in the 1st to m1 rounds based on the preset subtitle sentence length requirements, the semantics of the text fragments, and the pause duration between the audio fragments corresponding to the text fragments, and in the subsequent m1 During the +1 to Mth round of merging, merging is performed based on the preset subtitle sentence length requirements and the semantic features of the text fragments.
  • the electronic device can also combine the semantics of the text fragments and the pause duration characteristics between the audio fragments corresponding to the text fragments according to the above-mentioned preset subtitle sentence length requirements to obtain different merging results, that is, multiple versions of the subtitles can be obtained.
  • subtitle data and then select the subtitle data with better subtitle effect based on the subtitle effects presented by multiple versions of the subtitle data respectively.
  • multiple versions of subtitle data can be presented to the user, so that the user can preview the subtitle effects presented by various subtitle data respectively, and select subtitle data that meets the user's expectations as the final version of the subtitle data based on user operations.
  • multiple text fragments are merged to obtain a single subtitle with appropriate sentence length, ensuring that the single subtitle has an appropriate display duration in the video, and improving the subtitle understanding effect.
  • a single sentence with a large number of characters can be divided into multiple sentences, which are presented by multiple single subtitles. This avoids the need for a single subtitle to be long, the subtitles need to be displayed in multiple lines, the layout is confusing, and the display time is long.
  • Figure 5 is a flow chart of a subtitle generation method provided by another embodiment of the present disclosure. Referring to Figure 5, the method of this embodiment is based on the embodiment shown in Figure 1. After step S104, it also includes:
  • the video data of the video to be processed is the continuous video frame images in the video to be processed.
  • each single subtitle is superimposed on the video in the corresponding display time period according to the preset subtitle display style. frame image to obtain the target video with subtitles.
  • the display time period corresponding to a single subtitle can be determined based on the start time of the audio segment corresponding to the first character and the end time of the audio segment corresponding to the last character included in the subtitle, and then based on the start time corresponding to the single subtitle data. time and end time, determine the video frame image in the corresponding display time period, and superimpose a single subtitle on all video frame images in the corresponding display time period according to the preset display style; by executing each subtitle in the subtitle data Through the above processing, the target video with subtitles is obtained.
  • the subtitle sentence length in the target video obtained by the method provided in this embodiment is more suitable for users to read, and can Greatly improve user experience.
  • embodiments of the present disclosure also provide a subtitle generation device.
  • FIG. 6 is a schematic structural diagram of a subtitle generation device according to an embodiment of the present disclosure. Please refer to Figure 6.
  • the device 600 provided in this embodiment includes:
  • the audio processing module 601 is used to extract audio data in the video to be processed, perform speech recognition on the audio data, and obtain text data corresponding to the audio data.
  • the acquisition module 602 is configured to acquire multiple segmentation positions determined based on grammatical analysis of the text data, as well as pronunciation object information and timestamp information of the audio segments corresponding to each character included in the text data.
  • the text segmentation module 603 is used to segment the multiple segmentation positions, the pronunciation object information and the timestamp information of the audio segments corresponding to each character, segment the text data to obtain multiple text segments; in the text segment The audio segments corresponding to each character belong to the same pronunciation object, and the duration of the blank segments in the audio segments corresponding to the text segments is less than the preset duration.
  • the merging module 604 is used to merge the multiple text segments according to the semantics of each of the text segments and the timestamp information of the audio segments corresponding to each of the characters to obtain multiple semantically smooth and satisfying single subtitle sentences. Long requested merge fragments.
  • Generating module 605 configured to generate subtitle data corresponding to the video to be processed based on the multiple merged segments.
  • the merging module 604 is specifically configured to determine whether the merged adjacent text segments meet the preset single subtitle sentence length requirement and whether the semantics corresponding to the adjacent text segments are merged.
  • the plurality of text segments are merged based on at least one of the pause duration between smooth and adjacent text segments.
  • the merging module 604 is specifically configured to determine whether the adjacent text segments meet the merging condition based on whether the adjacent text segments meet the preset single subtitle sentence length requirement after merging; Determine whether the adjacent text fragments meet the merging conditions according to whether the corresponding semantics of the adjacent text fragments are smooth after merging; and, for each of the text fragments, between the text fragment and the two adjacent texts before and after If the segments all meet the merging conditions, adjacent text segments with a shorter pause duration between the text segment and the corresponding audio segment will be merged.
  • the preset sentence length requirement for a single subtitle includes at least one of a CPS requirement for characters per second or a maximum display duration requirement for a single subtitle in a video.
  • the text segmentation module 603 is specifically used to input the text data into A text processing module that obtains the multiple text segments output by the text processing module; wherein the text processing module includes: a sub-module that performs segmentation based on the multiple segmentation positions, and audio based on each of the characters.
  • the sub-module performs text segmentation based on the pronunciation object information of the segment, and the sub-module performs text segmentation based on the timestamp information of the audio segment corresponding to each of the characters.
  • a first segmentation module for segmenting text data based on punctuation analysis a second segmentation module for segmenting text data based on grammatical characteristics, and a third segmentation module for segmenting based on pronunciation object information corresponding to audio data. and a fourth segmentation module that segments based on the timestamp information of the audio segments corresponding to each character in the text data.
  • the first dividing module, the second dividing module, the third dividing module and the fourth dividing module are connected in a serial manner, and the third dividing module
  • the input of the second segmentation module includes the output of the second segmentation module and the audio data
  • the input of the fourth segmentation module includes the output of the third segmentation module and the timestamp information of the audio segment corresponding to each of the characters.
  • the first dividing module, the second dividing module, the third dividing module and the fourth dividing module are connected in parallel, and the first dividing module and
  • the input of the second segmentation module includes the text data
  • the input of the third segmentation module includes the text data and the audio data
  • the input of the fourth segmentation module includes each of the text data.
  • the timestamp information of the audio segment corresponding to the character and the text data.
  • the subtitle data is a text format subtitle SRT file.
  • the device 600 further includes: a fusion module 606, configured to fuse the subtitle data with the video to be processed to obtain a target video with subtitles.
  • a fusion module 606 configured to fuse the subtitle data with the video to be processed to obtain a target video with subtitles.
  • the subtitle generation device provided in this embodiment can be used to execute the technical solution of any of the foregoing method embodiments. Its implementation principles and technical effects are similar. Please refer to the detailed description of the foregoing method embodiments. For the sake of simplicity, they will not be described again here.
  • the present disclosure also provides an electronic device.
  • FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • the electronic device 700 provided in this embodiment includes: a memory 701 and a processor 702 .
  • the memory 701 may be an independent physical unit, and may be connected to the processor 702 through a bus 703 .
  • the memory 701 and the processor 702 can also be integrated together and implemented through hardware.
  • the memory 701 is used to store program instructions, and the processor 702 calls the program instructions to execute the subtitle generation method provided by any of the above method embodiments.
  • the above electronic device 700 may also include only the processor 702.
  • the memory 701 for storing programs is located outside the electronic device 700, and the processor 702 is connected to the memory through circuits/wires for reading and executing the programs stored in the memory.
  • the processor 702 may be a central processing unit (CPU), a network processor (NP), or a combination of CPU and NP.
  • CPU central processing unit
  • NP network processor
  • the processor 702 may further include hardware chips.
  • the above-mentioned hardware chip can be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any combination thereof.
  • the memory 701 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory may also include non-volatile memory (non-volatile memory), such as flash memory (flash memory). ), hard disk drive (hard disk drive, HDD) or solid-state drive (solid-state drive, SSD); the memory can also include a combination of the above types of memory.
  • volatile memory such as random-access memory (RAM)
  • non-volatile memory such as flash memory (flash memory).
  • flash memory flash memory
  • HDD hard disk drive
  • solid-state drive solid-state drive
  • the present disclosure also provides a readable storage medium, including: computer program instructions.
  • the computer program instructions When the computer program instructions are executed by at least one processor of an electronic device, the electronic device implements the subtitle generation method provided by any of the above method embodiments. .
  • the present disclosure also provides a computer program product.
  • the computer program product When the computer program product is run on a computer, it causes the computer to implement the subtitle generation method provided by any of the above method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Security & Cryptography (AREA)
  • Studio Circuits (AREA)

Abstract

本公开涉及一种字幕生成方法、装置、电子设备、存储介质及程序,该方法通过对待处理视频进行音频提取及语音识别,获取音频数据对应的文本数据;获取文本数据基于语法分析确定的多个切分位置及文本数据包括的各字符对应的音频片段的发音对象信息和时间戳信息;基于多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息将文本数据进行切分为多个文本片段;根据各文本片段的语义及各字符对应的音频片段的时间戳信息进行合并得到多个语义通顺且满足预设单条字幕句长要求的合并片段;基于合并片段生成字幕数据。

Description

字幕生成方法、装置、电子设备、存储介质及程序
相关申请的交叉引用
本申请是以中国申请号为202210615156.5,申请日为2022年5月31日的申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及多媒体技术领域,尤其涉及一种字幕生成方法、装置、电子设备、存储介质及程序。
背景技术
字幕是指基于视频中的对话、说明信息以及其他信息等生成的,用于展示在视频帧图像中的文字内容。由于字幕能够辅助用户理解视频内容,因此,为视频生成字幕极其重要。
在相关技术中,为视频生成字幕的方式通常是在视频生成之后,从视频中提取音频,对提取的音频进行语音识别,获得音频对应的文本,之后,再对文本进行标点恢复,得到文本片段,按照文本片段所对应的时间,将文本片段显示在相应的视频帧图像中。
发明内容
第一方面,本公开实施例提供了一种字幕生成方法,包括:
提取待处理视频中的音频数据,并对所述音频数据进行语音识别,获取所述音频数据对应的文本数据;
获取所述文本数据基于语法分析确定的多个切分位置以及所述文本数据包括的各字符对应的音频片段的发音对象信息和时间戳信息;
根据所述多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,对所述文本数据进行切分得到多个文本片段;所述文本片段中各字符分别对应的音频片段属于同一发音对象,且所述文本片段对应的音频片段中空白片段的时长小于预设时长;
根据各所述文本片段的语义以及各所述字符对应的音频片段的时间戳信息,对所 述多个文本片段进行合并,得到多个语义通顺且满足预设单条字幕句长要求的合并片段;
根据所述多个合并片段,生成所述待处理视频对应的字幕数据。
在一些实施例中,所述根据各所述文本片段的语义以及所述文本片段对应的音频片段的时间戳信息,对所述多个文本片段进行合并,包括:
根据相邻的所述文本片段合并后是否满足所述预设单条字幕句长要求、相邻所述文本片段分别对应的语义合并后是否通顺、相邻的文本片段之间的停顿时长中的至少一个,对所述多个文本片段进行合并。
在一些实施例中,所述根据各所述文本片段的语义以及所述文本片段对应的音频片段的时间戳信息,对所述多个文本片段进行合并,包括:
根据相邻的所述文本片段合并后是否满足所述预设单条字幕句长要求,确定相邻的所述文本片段是否满足合并条件;
根据相邻所述文本片段分别对应的语义合并后是否通顺,确定相邻的所述文本片段是否满足合并条件;
对于各所述文本片段,在所述文本片段与前后相邻两个文本片段均满足合并条件的情况下,将所述文本片段与对应的音频片段之间的停顿时长更短的相邻文本片段进行合并。
在一些实施例中,所述预设单条字幕句长要求包括:每秒字符数CPS要求或单条字幕在视频中的最大显示时长要求中的至少一种。
在一些实施例中,所述根据所述多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,对所述文本数据进行切分得到多个文本片段,包括:
将所述文本数据输入至文本处理模块,获取所述文本处理模块输出的所述多个文本片段;
其中,所述文本处理模块包括:基于所述多个切分位置进行分割的子模块、基于各所述字符对应的音频片段的发音对象信息进行文本分割的子模块、以及基于各所述字符对应的音频片段的时间戳信息进行文本分割的子模块。
在一些实施例中,文本处理模块包括:基于标点分析对进行文本数据分割的第一分割模块、基于语法特性进行文本数据分割的第二分割模块、基于音频数据对应的发音对象信息进行分割的第三分割模块以及基于文本数据中各字符分别对应的音频片段的时间戳信息进行分割的第四分割模块。
在一些实施例中,所述第一分割模块、所述第二分割模块、所述第三分割模块和所述第四分割模块以串行的方式连接,并且,所述第三分割模块的输入包括所述第二分割模块的输出以及所述音频数据,所述第四分割模块的输入包括所述第三分割模块的输出以及各所述字符对应的音频片段的时间戳信息。
在一些实施例中,所述第一分割模块、所述第二分割模块、所述第三分割模块和所述第四分割模块以并行的方式连接,并且,所述第一分割模块和所述第二分割模块的输入包括所述文本数据,所述第三分割模块的输入包括所述文本数据和所述音频数据,所述第四分割模块的输入包括所述文本数据中各所述字符对应的音频片段的时间戳信息以及所述文本数据。
在一些实施例中,所述字幕数据为文本格式字幕(SubRip Text,SRT)文件。
在一些实施例中,所述方法还包括:将所述字幕数据与所述待处理视频进行融合,获取有字幕的目标视频。
第二方面,本公开实施例提供了一种字幕生成装置,包括:
音频处理模块,用于提取待处理视频中的音频数据,并对所述音频数据进行语音识别,获取所述音频数据对应的文本数据;
获取模块,用于获取所述文本数据基于语法分析确定的多个切分位置以及所述文本数据包括的各字符对应的音频片段的发音对象信息和时间戳信息;
文本切分模块,用于所述多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,对所述文本数据进行切分得到多个文本片段;所述文本片段中各字符分别对应的音频片段属于同一发音对象,且所述文本片段对应的音频片段中空白片段的时长小于预设时长;
合并模块,用于根据各所述文本片段的语义以及各所述字符对应的音频片段的时间戳信息,对所述多个文本片段进行合并,得到多个语义通顺且满足预设单条字幕句长要求的合并片段;
生成模块,用于根据所述多个合并片段,生成所述待处理视频对应的字幕数据。
第三方面,本公开实施例还提供一种电子设备,包括:
存储器;以及
耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如第一方面以及第一方面任一项所述的字幕生成方法。
第四方面,本公开实施例还提供一种可读存储介质,其上存储有计算机程序,该 程序被处理器执行时实现如第一方面以及第一方面任一项所述的字幕生成方法。
第五方面,本公开实施例还提供一种计算机程序产品,包括:电子设备执行所述计算机程序产品,使得所述电子设备实现如第一方面以及第一方面任一项所述的字幕生成方法。
第六方面,本公开实施例还提供一种计算机程序,包括:指令,所述指令当由处理器执行时使所述处理器执行如第一方面以及第一方面任一项所述的字幕生成方法。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例所述字幕生成方法的流程图;
图2为本公开一实施例提供的字幕生成方法的流程图;
图3为本公开一实施例提供的字幕生成方法的流程图;
图4为本公开另一实施例提供的字幕生成方法的流程图;
图5为本公开另一实施例提供的字幕生成方法的流程图;
图6为本公开一实施例提供的字幕生成装置的结构示意图;
图7为本公开一实施例提供的电子设备的结构示意图。
具体实施方式
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。
目前,为视频生成字幕,通常包括以下过程:从视频中提取音频,对音频数据做语音识别,获得音频数据对应的文本数据,再对文本数据进行标点恢复,得到切分后 的文本片段;再根据这些文本片段所对应的视频片段的时间,生成字幕数据,将字幕数据与视频进行融合,从而得到带字幕的视频。采用该方式,在进行文本数据的片段化处理时,依赖于标点恢复的结果,使得单条字幕的句长无法得到较好地控制,从而影响字幕的排版以及字幕在视频中的显示时长,降低字幕的主观体验,无法起到较好的辅助理解效果。
例如,单条字幕的句长较长,即单条字幕数据包含的字符数量较多,电子设备的显示屏幕尺寸有限,该条字幕需要折叠显示,即需要通过多行显示;然而,字幕所占行数较多时,字幕的遮挡区域会扩大,可能遮挡较多的视频画面,影响用户观看视频内容;此外,单条字幕的句长较长,单条字幕在视频中的显示时长会增加,也会影响用户观看视频内容。
又如,一些短句且语速较快,单条字幕的句长较短,即单条字幕数据包含的字符数量较少,但各字符的发音时长较短,因此,字幕在视频中的显示时长较短,用户可能来不及详细看字幕内容,无法达到字幕的辅助理解的目的。
又如,相同的文本,停顿时长不同,可能表达不同的语义,通过标点恢复得到的字幕可能无法准确表达相同文本在不同音频位置的语义。
基于此,本公开提供一种字幕生成方法,通过提取待处理视频中的音频数据,并对音频数据进行语音识别,获得相应的文本数据;通过得到文本数据基于语法分析确定的多个切分位置以及文本数据包括的各字符对应的音频片段的发音对象信息和时间戳信息;基于多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,将文本数据进行切分得到多个符合要求的文本片段;再根据各文本片段的语义以及各字符对应的音频片段的时间戳信息,对多个文本片段进行合并,得到多个语义通顺且满足预设单条字幕句长要求的合并片段;根据多个合并片段,生成所述待处理视频对应的字幕数据。本公开的方法通过结合文本维度以及音频维度的特征进行切分以及合并,能够更好的控制单条字幕的句长以及单条字幕在视频中的显示时长,使得字幕的辅助理解效果大幅提升。此外,在合并以及切分的过程中充分考虑了字符对应的音频片段之间的空白时长,使得表达不同含义的相同语音内容是按照不同的方式进行切分以及合并的,因此,该方法还能够有效降低歧义的发生。
示例性地,本实施例提供的字幕生成方法可以由电子设备执行。电子设备可以是平板电脑、手机(如折叠屏手机、大屏手机等)、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超 级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)、智能电视、智慧屏、高清电视、4K电视、智能音箱、智能投影仪等物联网(the internet of things,IOT)设备,本公开对电子设备的具体类型不作任何限制。其中,本公开对电子设备的操作系统的类型不做限定。例如,Android系统、Linux系统、Windows系统、iOS系统等。
基于前述描述,本公开以实施例将以电子设备为例,结合附图和应用场景,对本公开提供的字幕生成方法进行详细阐述。
图1为本公开一实施例提供的字幕生成方法的流程图。请参阅图1所示,本实施例的方法包括以下步骤。
S101、提取待处理视频中的音频数据,并对所述音频数据进行语音识别,获取音频数据对应的文本数据。
待处理视频为要添加字幕的视频,电子设备可以获取待处理视频,其中,待处理视频可以是用户通过电子设备录制的,也可以是从网络上下载的,或者,还可以是用户通过视频处理类软件制作的,本公开对于获取待处理视频的实现方式不做限定。且本公开对于待处理视频中的视频内容、时长、存储格式、清晰度等等参数均不做限定。
电子设备能够提取待处理视频中的音频数据,并将音频数据转换为文本数据。示例性地,电子设备可以通过语音识别模型将音频数据转换为文本数据,本公开对于语音识别模型的类型和参数不做限定,例如,语音识别模型可以为深度神经网络模块、卷积神经网络模型等等。或者,电子设备也可以利用其他现有的语音识别工具或者方式将音频数据转换为文本数据。本公开对于电子设备进行语音识别的实现方式不做限定。
文本数据可以包括连续的字符序列,例如,文本数据包括“今天我很高兴我和爸爸妈妈去了游乐场”,其中不包含标点符号。需要说明的是,由于音频数据可以对应一种或多种语言种类,因此,在生成的文本数据中也可以包括一种或多种语言种类分别对应的字符。
当然,在语音识别时,也可以尽量将音频转换为一种语言,方便后续进行片段化处理。例如,针对音频片段获得语音识别结果为“哈喽”,或者,也可以获得语音识别结果为“hello”,由于整个文本数据中中文字符比例较高,因此,若以提高字幕中语言种类一致性为目的,则可以选择前者,若以增加字幕的趣味性为目的,则可以选择后者。
S102、获取文本数据基于语法分析确定的多个切分位置以及文本数据包括的各字符对应的音频片段的发音对象信息和时间戳信息。
电子设备可以通过语法分析的模型对文本数据进行分析,得到多个切分位置。其中,语法分析可以包括:标点位置分析、语法特性分析等等,通过语法分析能够得到多个分句位置,分句位置即为切分位置。
电子设备可以通过对音频数据进行发音对象识别,将不同发音对象对应的音频片段识别出来,之后,再结合不同发音对象对应的音频片段与文本数据之间的对应关系,得到各字符对应的音频片段的发音对象信息。
电子设备可以通过对音频数据进行切分,得到每个字符对应的音频片段的时间戳信息,时间戳信息可以包括起始时间和结束时间。
S103、根据多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,对文本数据进行切分得到多个文本片段。
每个切分得到的文本片段中,各字符分别对应的音频片段属于同一发音对象,且文本片段对应的音频片段中空白片段的时长小于预设时长。
将文本数据切分为多个文本片段可以通过文本处理模块实现,文本处理模块可以包括多个子模块,每个子模块用于根据前述一个或多个维度的特征对输入的文本数据进行切分处理,在文本数据经过多个子模块处理之后则可以将文本数据分割为多个第一文本片段。
对于通过文本处理模块进行的文本数据的切分,后文中通过图2以及图3所示实施例进行示例说明。
S104、根据各文本片段的语义以及各字符对应的音频片段的时间戳信息,对多个文本片段进行合并,得到多个语义通顺且满足预设单条字幕句长要求的合并片段。
在一些实施例中,根据相邻的文本片段合并后是否满足预设单条字幕句长要求,确定相邻的文本片段是否满足合并条件。文本片段的语义可以通过对文本片段进行语义分析获得,基于语义能够判断相邻的文本片段所要表达的内容是否连续通顺,进而可以作为合并依据指导文本片段的合并,避免将语义不通顺的文本片段合并在一起,给用户带来不好的体验。
在一些实施例中,根据通过各字符对应的音频片段的时间戳信息能够得到文本片段之间的停顿时长。具体地,可以根据前一个文本片段的最后一个字符对应的音频片段的结束时间以及后一个文本片段的第一个字符对应的音频片段的起始时间确定相 邻文本片段之间的停顿时长。在合并过程中,可以趋向于将文本片段之间的停顿时长更短的两个相邻文本片段进行合并。停顿时长更短可以表明音频数据中想要表达的内容的连续性更强,合并在一起更加完整地表达音频数据中的内容,从而更有利于用户理解。
此外,在合并的过程中,还需要判断多个文本片段合并是否满足预设单条字幕句长要求,进而控制字幕句长,也控制了字幕在屏幕上的显示时长。
结合上述三个方面进行文本片段的合并,便能够得到语义通顺、且符合预设字幕句长要求的合并片段。
例如,文本片段1、文本片段2和文本片段3为连续的三个第一文本片段。其中,基于语义确定文本片段1和文本片段2可以合并,文本片段2和文本片段3可以合并,且文本片段1和文本片段2之间的停顿时长为t1,文本片段2和文本片段3之间的停顿时长为t2,t1小于t2,则文本片段1和文本片段2合并更加合理;此外,文本片段1和文本片段2合并之后能够满足预设单条字幕句长要求,因此,满足合并的条件;则可以将文本片段1与文本片段2合并。
需要说明的是,文本片段1和文本片段2合并所获得的合并片段可能是最终获得的单条字幕对应的合并片段,也可能需要将合并片段再与相邻的文本片段3合并之后才能获得单条字幕对应的合并片段。
S105、根据多个合并片段,生成待处理视频对应的字幕数据。
每个合并片段对应一条字幕,将多个合并片段按照顺序,转换为预设格式的字幕文件,从而得到待处理视频对应的字幕数据。
字幕数据可以但不限于为SRT文件。
本实施例提供的方法,通过结合文本维度以及音频维度的特征对文本数据进行分割以及对分割得到的文本片段进行合并,能够更好的控制单条字幕的句长以及单条字幕在视频中的显示时长,且不影响语义理解,使得字幕的辅助理解效果大幅提升;此外,该方法还能够有效降低歧义的发生。
结合图1所示实施例的描述可知,电子设备在通过文本处理模块(也可以理解为文本处理模型)实现对文本数据进行的切分时,文本处理模块中各子模块的连接顺序可以灵活设置。图2和图3分别示例性地示出了两种不同的方式。
假设在图2以及图3所示实施例中,文本处理模块包括:基于标点分析进行文本数据分割的第一分割模块、基于语法特性进行文本数据分割的第二分割模块、基于音 频数据对应的发音对象信息进行分割的第三分割模块以及基于文本数据中各字符分别对应的音频片段的时间戳信息进行分割的第四分割模块。
图2为本公开一实施例提供的文本处理模块的结构示意图。请参阅图2所示,第一分割模块的输出端与第二分割模块的输入端连接,第二分割模块的输出端与第三分割处理模块的输入端连接,第三模块的输出端与第四分割模块的输入端连接。结合图2所示实施例中文本处理模块的结构,文本处理模块包括的各分割模块可以理解为串行的方式连接。
第一分割模块用于接收文本数据作为输入,对文本数据进行标点分析(也可以理解为标点恢复处理)得到的多个标点符号的分句位置,基于这些分句位置可将文本数据分割为文本片段;将第一分割模块输出的文本片段输入至第二分割模块,对文本片段进行语法特性分析确定多个分割位置,可以基于多个分割位置对第一分割模块的文本片段进行再次分割或者调整,得到多个文本片段;将第二分割模块输出的文本片段以及音频数据输入至第三分割模块,第三分割模块对音频数据进行发音对象识别,确定不同发音对象对应的音频片段的起始位置和结束位置,再基于不同发音对象对应的音频片段,确定文本数据中的分割位置,基于确定的分割位置对文本片段再次进行分割即可使切分后的文本片段对应单个发音对象;接着,第四分割模块根据各字符分别对应的音频片段的起始时间和结束时间,确定相邻字符的停顿时长,并基于相邻字符的停顿时长和预设时长(即时长阈值)的比较结果,将停顿时长小于预设时长的相邻字符划分为一个文本片段,将停顿时长大于或等于预设时长的相邻字符切分至不同的两个文本片段中。在此基础上,文本处理模块包括的最后一个子模块(即第四分割模块)输出的多个文本片段即为文本数据对应的最终的切分结果。
本公开对于预设时长的取值大小不做限定,例如,可以为0.4秒、0.5秒、0.6秒等等,预设时长可以根据大量的音频数据中各字符分别对应的音频片段之间的停顿时长进行统计分析获得。
作为一种可能的实施方式,文本处理模块包括的各分割模块可以分别利用相应的机器学习模型实现,例如,第一分割模块可以基于预先训练好的标点恢复处理模型实现,第二分割模块可以基于预先训练好的语法特性分析模型实现,第三分割模块即发音对象分割模处理模块可以基于预先训练好的音频处理模型实现,第四分割模块即停顿时长分割模块可以基于预先训练好的字符处理模型实现。本公开对于各分割模块所采用的机器学习模型的类型以及模型参数等等不做限定。
图3为本公开一实施例提供的文本处理模块的结构示意图。请参阅图3所示,文本处理模块包括的各分割模块采用并行的方式连接,第一分割模块和第二分割模块分别接收原始的文本数据作为输入;第三分割模块接收音频数据以及原始的文本数据作为输入;第四分割模块接收原始的文本数据作为输入,且文本数据包括的各字符携带时间戳信息。文本处理模块包括的各分割模块分别基于各自的输入,确定分割位置,以对文本数据进行分割,之后,再基于各分割模块分别输出的文本数据的分割结果进行融合,从而获得多个文本片段。
文本处理模块中各分割模块的处理方式可参照图2所示实施例的描述,简明起见,此处不再赘述。
需要说明的是,文本处理模块包括的各分割模块的连接方式并不限于上述图2以及图3示例,还可以采用其他方式实现。示例性地,可以将串行的连接方式和并行的连接方式相结合,如,第一分割模块和第二分割模块采用串行的方式连接,第三分割模块和第四分割模块采用串行的方式连接,第一分割模块和第二分割模块作为一个整体与第三分割模块和第四分割模块作为另一个整体并行连接。
此外,还需要说明的是,文本处理模块包括的各分割模块的连接顺序可以根据场景不同灵活调整,例如,在发音对象较多的场景中,可以先基于发音对象进行分割处理,之后,再基于标点分析、语法特性分析以及各字符对应的音频片段的时间戳信息进行分割处理。
图4为本公开一实施例提供的字幕生成方法的流程图。其中,图4所示实施例主要用于示例性地介绍电子设备如何进行文本片段的合并。请参阅图4所示,在合并文本片段时,电子设备可以通过调用合并模块实现。合并模块包括:指标模块、语义分析模块、停顿时长比较模块以及文本拼接模块。
指标模块可以判断输入的两个文本片段进行合并是否满足预设字幕句长要求,其中,预设字幕句长要求主要为单条字幕在视频中的留存时间的要求,为了方便确定生成的单条字幕是否满足要求,预设字幕句长要求可以为预设的每秒对应的最大字符数(CPS),或者,预设单条字幕在视频中的最大显示时长,上述两个指标也可以较好地体现单条字幕在视频中的留存时间。
此外,语义分析模块可以基于输入的两个文本片段分别对应的语义确定是否可以合并,并向文本拼接模块输出指示文本片段是否可以合并的标识信息,例如,语义分析模块输出标识1则表示可以合并,输出标识0则表示不可以合并。
停顿时长比较模块,用于根据文本片段包括的各字符分别对应的音频片段的时间戳信息确定相邻的多个文本片段之间的停顿时长对比结果。
文本拼接模块结合前述指标模块、语义分析模块、停顿时长比较模块分别输出的结果或者指示信息确定合并方案,将符合预设字幕句长要求、语义通顺且文本片段之间停顿时长较短的文本片段进行拼接,从而获得多个合并片段。
在实现过程中,指标模块和语义分析模块可以交互数据,例如,指标模块可以将判断结果输出给语义分析模块,语义分析模块可以针对符合预设字幕句长要求的文本片段组合进行判断,对于不符合预设字幕句长要求的文本片段组合不判断语义是否连续通顺,从而减小语义化分析模块的工作量,提高字幕生成效率。
假设文本数据经过切分之后,获得N个文本片段,分别为文本片段1、文本片段2至文本片段N。
示例性地,电子设备可以依次确定文本片段1和文本片段2、文本片段2和文本片段3合并是否符合预设字幕句长要求。若基于语义特征确定文本片段1和文本片段2可以合并、文本片段2和文本片段3也可以合并,但文本片段1和文本片段2之间的停顿时长段,则将文本片段1和文本片段2合并,获得合并片段1。之后,电子设备可以根据预设字幕句长要求以及文本片段的语义确定合并片段1和文本片段3是否可以合并,若可以合并,则将合并片段1和文本片段3合并,获得新的合并片段1。或者,电子设备也可以根据预设字幕句长要求以及文本片段的语义确定文本片段3和文本片段4是否可以合并,若可以合并,则将文本片段3和文本片段4合并,获得合并片段2。电子设备可以对比新的合并片段1与文本片段3合并获得的合并片段的字幕效果以及文本片段3与文本片段4合并获得的合并片段的字幕效果,并确定文本片段3最终的合并方案。
以此类推,可以获得每个文本片段的合并方案。
需要说明的是,确定两个文本片段合并是否符合预设字幕句长要求、基于两个文本片段的语义确定是否可以合并以及对比前后相邻的文本片段对应的音频片段之间的停顿时长三者可以并行执行,之后,再结合三者分别输出的判断结果进行合并。
还需要说明的是,上述合并可以经过多轮的处理过程,例如,第一轮合并处理获得的合并片段的句长均较短,则可以将第一轮获得的合并片段作为输入,再执行一轮合并处理,获得单条字幕句长无限逼近预设字幕句长要求。
另一种可能的实施方式,由于文本片段1至文本片段N包括的字符数量较少,可 能需要多轮合并,则可以在第1至第m1轮合并过程中,基于预设字幕句长要求、文本片段的语义以及文本片段对应的音频片段之间的停顿时长进行合并,在后续第m1+1至第M轮合并过程中根据预设字幕句长要求以及文本片段的语义特征进行合并。
在一些情况下,电子设备也可以根据上述预设字幕句长要求结合文本片段的语义以及文本片段对应的音频片段之间的停顿时长特征,得到中不同的合并结果,即能够得到多个版本的字幕数据,再根据多个版本的字幕数据分别呈现的字幕效果,从中选择字幕效果较佳的字幕数据。例如,可以将多个版本的字幕数据均呈现给用户,使得用户能够预览各种字幕数据分别呈现的字幕效果,基于用户操作选择符合用户预期的字幕数据作为最终版本的字幕数据。
通过本公开提供的方法,对多个文本片段进行合并,能够获得句长合适的单条字幕,保证单条字幕在视频中具有合适的显示时长,提高字幕的辅助理解效果。例如,通过本公开提供的方案,能够将字符数量较多的单个语句分为多个语句,分别由多个单条字幕呈现,避免单条字幕较长,字幕需要多行显示排版混乱,且显示时间较长的问题;对于短句且语速较快的情况下,可以将短句对应的字符与相邻的语句的字符进行组合,增加短句对应的字幕在视频中的留存时长,保证用户有足够的时间清楚看到字幕中的内容;且本公开提供的方法通过文本片段对应的音频片段之间的停顿时长,确定文本片段与内容连续性更强的文本片段进行合并,能够有效降低歧义的发生,保证字幕数据准确表达音频数据的内容。
图5为本公开另一实施例提供的字幕生成方法的流程图。请参阅图5所示,本实施例的方法在图1所示实施例的基础上,步骤S104之后,还包括:
S106、将字幕数据与待处理视频进行融合,获取有字幕的目标视频。
待处理视频的视频数据即为待处理视频中连续的视频帧图像,针对字幕数据包括的各单条字幕,按照预先设定的字幕的显示样式,将各单条字幕分别叠加在相应显示时间段的视频帧图像中,从而获得带有字幕的目标视频。
单条字幕所对应的显示时间段可以根据该条字幕包括的第一个字符对应的音频片段的起始时间以及最后一个字符对应的音频片段的结束时间确定,再基于单条字幕数据所对应的起始时间和结束时间,确定相应显示时间段内的视频帧图像,将单条字幕按照预先设定的显示样式叠加在相应显示时间段内的所有视频帧图像中;通过对字幕数据中的每条字幕执行上述处理,从而得到带字幕的目标视频。
通过本实施例提供的方法得到的目标视频中的字幕句长更加适合用户阅读,能够 极大幅度提升用户体验。
示例性地,本公开实施例还提供一种字幕生成装置。
图6为本公开一实施例提供的字幕生成装置的结构示意图。请参阅图6所示,本实施例提供的装置600,包括:
音频处理模块601,用于提取待处理视频中的音频数据,并对所述音频数据进行语音识别,获取所述音频数据对应的文本数据。
获取模块602,用于获取所述文本数据基于语法分析确定的多个切分位置以及所述文本数据包括的各字符对应的音频片段的发音对象信息和时间戳信息。
文本切分模块603,用于所述多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,对所述文本数据进行切分得到多个文本片段;所述文本片段中各字符分别对应的音频片段属于同一发音对象,且所述文本片段对应的音频片段中空白片段的时长小于预设时长。
合并模块604,用于根据各所述文本片段的语义以及各所述字符对应的音频片段的时间戳信息,对所述多个文本片段进行合并,得到多个语义通顺且满足预设单条字幕句长要求的合并片段。
生成模块605,用于根据所述多个合并片段,生成所述待处理视频对应的字幕数据。
作为一种可能的实施方式,合并模块604,具体用于根据相邻的所述文本片段合并后是否满足所述预设单条字幕句长要求、相邻所述文本片段分别对应的语义合并后是否通顺、相邻的文本片段之间的停顿时长中的至少一个,对所述多个文本片段进行合并。
作为一种可能的实施方式,合并模块604,具体用于根据相邻的所述文本片段合并后是否满足所述预设单条字幕句长要求,确定相邻的所述文本片段是否满足合并条件;根据相邻所述文本片段分别对应的语义合并后是否通顺,确定相邻的所述文本片段是否满足合并条件;以及,对于各所述文本片段,在所述文本片段与前后相邻两个文本片段均满足合并条件的情况下,则将所述文本片段与对应的音频片段之间的停顿时长更短的相邻文本片段进行合并。
作为一种可能的实施方式,所述预设单条字幕句长要求包括:每秒字符数CPS要求或单条字幕在视频中的最大显示时长要求中的至少一种。
作为一种可能的实施方式,文本切分模块603,具体用于将所述文本数据输入至 文本处理模块,获取所述文本处理模块输出的所述多个文本片段;其中,所述文本处理模块包括:基于所述多个切分位置进行分割的子模块、基于各所述字符对应的音频片段的发音对象信息进行文本分割的子模块、以及基于各所述字符对应的音频片段的时间戳信息进行文本分割的子模块。
作为一种可能的实施方式,基于标点分析对进行文本数据分割的第一分割模块、基于语法特性进行文本数据分割的第二分割模块、基于音频数据对应的发音对象信息进行分割的第三分割模块以及基于文本数据中各字符分别对应的音频片段的时间戳信息进行分割的第四分割模块。
作为一种可能的实施方式,所述第一分割模块、所述第二分割模块、所述第三分割模块和所述第四分割模块以串行的方式连接,并且,所述第三分割模块的输入包括所述第二分割模块的输出以及所述音频数据,所述第四分割模块的输入包括所述第三分割模块的输出以及各所述字符对应的音频片段的时间戳信息。
作为一种可能的实施方式,所述第一分割模块、所述第二分割模块、所述第三分割模块和所述第四分割模块以并行的方式连接,并且,所述第一分割模块和所述第二分割模块的输入包括所述文本数据,所述第三分割模块的输入包括所述文本数据和所述音频数据,所述第四分割模块的输入包括所述文本数据中各所述字符对应的音频片段的时间戳信息以及所述文本数据。
作为一种可能的实施方式,字幕数据为文本格式字幕SRT文件。
作为一种可能的实施方式,装置600还包括:融合模块606,用于将所述字幕数据与所述待处理视频进行融合,获取有字幕的目标视频。
本实施例提供的字幕生成装置可以用于执行前述任一方法实施例的技术方案,其实现原理以及技术效果类似,可参照前述方法实施例的详细描述,简明起见,此处不再赘述。
示例性地,本公开还提供一种电子设备。
图7为本公开一实施例提供的电子设备的结构示意图。请参阅图7所示,本实施例提供的电子设备700包括:存储器701和处理器702。
其中,存储器701可以是独立的物理单元,与处理器702可以通过总线703连接。存储器701、处理器702也可以集成在一起,通过硬件实现等。
存储器701用于存储程序指令,处理器702调用该程序指令,执行以上任一方法实施例提供的字幕生成方法。
可选地,当上述实施例的方法中的部分或全部通过软件实现时,上述电子设备700也可以只包括处理器702。用于存储程序的存储器701位于电子设备700之外,处理器702通过电路/电线与存储器连接,用于读取并执行存储器中存储的程序。
处理器702可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合。
处理器702还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
存储器701可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器还可以包括上述种类的存储器的组合。
本公开还提供一种可读存储介质,包括:计算机程序指令,所述计算机程序指令被电子设备的至少一个处理器执行时,使得所述电子设备实现如上任一方法实施例提供的字幕生成方法。
本公开还提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机实现如上任一方法实施例提供的字幕生成方法。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此, 本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (15)

  1. 一种字幕生成方法,包括:
    提取待处理视频中的音频数据,并对所述音频数据进行语音识别,获取所述音频数据对应的文本数据;
    获取所述文本数据基于语法分析确定的多个切分位置,以及所述文本数据包括的各字符对应的音频片段的发音对象信息和时间戳信息;
    根据所述多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,对所述文本数据进行切分得到多个文本片段;所述文本片段中各字符分别对应的音频片段属于同一发音对象,且所述文本片段对应的音频片段中空白片段的时长小于预设时长;
    根据各所述文本片段的语义以及各所述字符对应的音频片段的时间戳信息,对所述多个文本片段进行合并,得到多个语义通顺且满足预设单条字幕句长要求的合并片段;
    根据所述多个合并片段,生成所述待处理视频对应的字幕数据。
  2. 根据权利要求1所述的字幕生成方法,其中,所述根据各所述文本片段的语义以及所述文本片段对应的音频片段的时间戳信息,对所述多个文本片段进行合并,包括:
    根据相邻的所述文本片段合并后是否满足所述预设单条字幕句长要求、相邻所述文本片段分别对应的语义合并后是否通顺、相邻的文本片段之间的停顿时长中的至少一个,对所述多个文本片段进行合并。
  3. 根据权利要求2所述的字幕生成方法,其中,所述根据各所述文本片段的语义以及所述文本片段对应的音频片段的时间戳信息,对所述多个文本片段进行合并,包括:
    根据相邻的所述文本片段合并后是否满足所述预设单条字幕句长要求,确定相邻的所述文本片段是否满足合并条件;
    根据相邻所述文本片段分别对应的语义合并后是否通顺,确定相邻的所述文本片段是否满足合并条件;
    对于各所述文本片段,在所述文本片段与前后相邻两个文本片段均满足合并条件的情况下,将所述文本片段与对应的音频片段之间的停顿时长更短的相邻文本片段进 行合并。
  4. 根据权利要求1~3中任一项所述的字幕生成方法,其中,所述预设单条字幕句长要求包括:每秒字符数CPS要求或单条字幕在视频中的最大显示时长要求中的至少一种。
  5. 根据权利要求1~4中任一项所述的字幕生成方法,其中,所述根据所述多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,对所述文本数据进行切分得到多个文本片段,包括:
    将所述文本数据输入至文本处理模块,获取所述文本处理模块输出的所述多个文本片段;
    其中,所述文本处理模块包括:基于所述多个切分位置进行分割的子模块、基于各所述字符对应的音频片段的发音对象信息进行文本分割的子模块、以及基于各所述字符对应的音频片段的时间戳信息进行文本分割的子模块。
  6. 根据权利要求5所述的字幕生成方法,其中,所述文本处理模块包括:基于标点分析对进行文本数据分割的第一分割模块、基于语法特性进行文本数据分割的第二分割模块、基于音频数据对应的发音对象信息进行分割的第三分割模块以及基于文本数据中各字符分别对应的音频片段的时间戳信息进行分割的第四分割模块。
  7. 根据权利要求5或6所述的字幕生成方法,其中,所述第一分割模块、所述第二分割模块、所述第三分割模块和所述第四分割模块以串行的方式连接,并且,所述第三分割模块的输入包括所述第二分割模块的输出以及所述音频数据,所述第四分割模块的输入包括所述第三分割模块的输出以及各所述字符对应的音频片段的时间戳信息。
  8. 根据权利要求5~7中任一项所述的字幕生成方法,其中,所述第一分割模块、所述第二分割模块、所述第三分割模块和所述第四分割模块以并行的方式连接,并且,所述第一分割模块和所述第二分割模块的输入包括所述文本数据,所述第三分割模块的输入包括所述文本数据和所述音频数据,所述第四分割模块的输入包括所述文本数据中各所述字符对应的音频片段的时间戳信息以及所述文本数据。
  9. 根据权利要求1~8中任一项所述的字幕生成方法,其中,所述字幕数据为文本格式字幕SRT文件。
  10. 根据权利要求1~9中任一项所述的字幕生成方法,还包括:
    将所述字幕数据与所述待处理视频进行融合,获取有字幕的目标视频。
  11. 一种字幕生成装置,其中,包括:
    音频处理模块,用于提取待处理视频中的音频数据,并对所述音频数据进行语音识别,获取所述音频数据对应的文本数据;
    获取模块,用于获取所述文本数据基于语法分析确定的多个切分位置以及所述文本数据包括的各字符对应的音频片段的发音对象信息和时间戳信息;
    文本切分模块,用于所述多个切分位置、各字符对应的音频片段的发音对象信息和时间戳信息,对所述文本数据进行切分得到多个文本片段;所述文本片段中各字符分别对应的音频片段属于同一发音对象,且所述文本片段对应的音频片段中空白片段的时长小于预设时长;
    合并模块,用于根据各所述文本片段的语义以及各所述字符对应的音频片段的时间戳信息,对所述多个文本片段进行合并,得到多个语义通顺且满足预设单条字幕句长要求的合并片段;
    生成模块,用于根据所述多个合并片段,生成所述待处理视频对应的字幕数据。
  12. 一种电子设备,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1至10任一项所述的字幕生成方法。
  13. 一种可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现权利要求1至10任一项所述的字幕生成方法。
  14. 一种计算机程序产品,其中,电子设备执行所述计算机程序产品,使得所述电子设备实现如权利要求1至10任一项所述的字幕生成方法。
  15. 一种计算机程序,包括:
    指令,所述指令当由处理器执行时使所述处理器执行1至10任一项所述的字幕生成方法。
PCT/CN2023/097415 2022-05-31 2023-05-31 字幕生成方法、装置、电子设备、存储介质及程序 WO2023232073A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210615156.5A CN117201876A (zh) 2022-05-31 2022-05-31 字幕生成方法、装置、电子设备、存储介质及程序
CN202210615156.5 2022-05-31

Publications (1)

Publication Number Publication Date
WO2023232073A1 true WO2023232073A1 (zh) 2023-12-07

Family

ID=88998479

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/097415 WO2023232073A1 (zh) 2022-05-31 2023-05-31 字幕生成方法、装置、电子设备、存储介质及程序

Country Status (2)

Country Link
CN (1) CN117201876A (zh)
WO (1) WO2023232073A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019164535A1 (en) * 2018-02-26 2019-08-29 Google Llc Automated voice translation dubbing for prerecorded videos
EP3817395A1 (en) * 2019-10-30 2021-05-05 Beijing Xiaomi Mobile Software Co., Ltd. Video recording method and apparatus, device, and readable storage medium
CN112995736A (zh) * 2021-04-22 2021-06-18 南京亿铭科技有限公司 语音字幕合成方法、装置、计算机设备及存储介质
CN112995754A (zh) * 2021-02-26 2021-06-18 北京奇艺世纪科技有限公司 字幕质量检测方法、装置、计算机设备和存储介质
CN113225612A (zh) * 2021-04-14 2021-08-06 新东方教育科技集团有限公司 字幕生成方法、装置、计算机可读存储介质及电子设备
CN113889113A (zh) * 2021-11-10 2022-01-04 北京有竹居网络技术有限公司 分句方法、装置、存储介质及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019164535A1 (en) * 2018-02-26 2019-08-29 Google Llc Automated voice translation dubbing for prerecorded videos
EP3817395A1 (en) * 2019-10-30 2021-05-05 Beijing Xiaomi Mobile Software Co., Ltd. Video recording method and apparatus, device, and readable storage medium
CN112995754A (zh) * 2021-02-26 2021-06-18 北京奇艺世纪科技有限公司 字幕质量检测方法、装置、计算机设备和存储介质
CN113225612A (zh) * 2021-04-14 2021-08-06 新东方教育科技集团有限公司 字幕生成方法、装置、计算机可读存储介质及电子设备
CN112995736A (zh) * 2021-04-22 2021-06-18 南京亿铭科技有限公司 语音字幕合成方法、装置、计算机设备及存储介质
CN113889113A (zh) * 2021-11-10 2022-01-04 北京有竹居网络技术有限公司 分句方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN117201876A (zh) 2023-12-08

Similar Documents

Publication Publication Date Title
US11003349B2 (en) Actionable content displayed on a touch screen
US9645985B2 (en) Systems and methods for customizing text in media content
CN111813998B (zh) 一种视频数据处理方法、装置、设备及存储介质
WO2022227218A1 (zh) 药名识别方法、装置、计算机设备和存储介质
US20180157657A1 (en) Method, apparatus, client terminal, and server for associating videos with e-books
CN111798543B (zh) 模型训练方法、数据处理方法、装置、设备及存储介质
EP4322029A1 (en) Method and apparatus for generating video corpus, and related device
CN104994404A (zh) 一种为视频获取关键词的方法及装置
JP2022160662A (ja) 文字認識方法、装置、機器、記憶媒体、スマート辞書ペン及びコンピュータプログラム
CN112995749A (zh) 视频字幕的处理方法、装置、设备和存储介质
CN114268829B (zh) 视频处理方法、装置、电子设备及计算机可读存储介质
CN110889266A (zh) 一种会议记录整合方法和装置
CN114694070A (zh) 一种自动视频剪辑方法、系统、终端及存储介质
KR20200063316A (ko) 각본 기반의 영상 검색 장치 및 방법
CN107066438A (zh) 一种文本编辑方法及装置,电子设备
CN113987264A (zh) 视频摘要生成方法、装置、设备、系统及介质
CN113923479A (zh) 音视频剪辑方法和装置
WO2023232073A1 (zh) 字幕生成方法、装置、电子设备、存储介质及程序
CN112233661B (zh) 基于语音识别的影视内容字幕生成方法、系统及设备
JP2024517902A (ja) 音声認識トレーニングセットの生成のための方法および装置
CN112784527A (zh) 一种文档合并方法、装置及电子设备
US20240046048A1 (en) Synchronizing translation with source multimedia
WO2023083252A1 (zh) 音色选择方法、装置、电子设备、可读存储介质及程序产品
CN115942005A (zh) 用于生成解说视频的方法、装置、设备和存储介质
CN115329104A (zh) 会议纪要文件生成方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23815250

Country of ref document: EP

Kind code of ref document: A1