CN114420104A

CN114420104A - Method for automatically generating subtitles and related product

Info

Publication number: CN114420104A
Application number: CN202210102736.4A
Authority: CN
Inventors: 高圣洲; 穆禹彤; 王艳; 孙艳庆; 段亦涛; 周枫
Original assignee: Netease Youdao Information Technology Beijing Co Ltd
Current assignee: Netease Youdao Information Technology Beijing Co Ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-04-29

Abstract

The embodiment of the invention provides a method for automatically generating subtitles and a related product. Wherein, the method comprises the following steps: acquiring audio information related to a voice time sequence in a multimedia file; acquiring a voice recognition text of the audio information; performing text readability enhancement processing on the voice recognition text; and generating a subtitle text which is contrasted with the voice in the multimedia file based on the processed voice recognition text. By the scheme of the invention, the subtitle text which is contrasted with the voice in the multimedia file can be automatically generated, so that not only can the labor cost and the time cost be effectively saved, but also the high matching between the subtitle text and the voice can be ensured. Further, an apparatus and a computer-readable storage medium are provided.

Description

Method for automatically generating subtitles and related product

Technical Field

Embodiments of the present invention relate to the field of information processing technology, and more particularly, to a method for automatically generating subtitles, an apparatus for performing the method, and a computer-readable storage medium.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Thus, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

During hearing learning, the user's hearing material may come from a variety of sources. Therefore, there may be cases where the hearing material is not present in the control text or the hearing material is not matched with the control text (e.g., the control text is too long or the paragraph organization is not reasonable), which is not conducive to the user's control learning. For this reason, the text information is usually manually entered one by one according to the audio frequency, and the contrast text meeting the requirement is generated as required, and this way of manually producing the contrast text needs to consume a lot of time cost and labor cost.

Disclosure of Invention

The generation process of the reference text of the known hearing material is a time-consuming and laborious process which is very annoying.

Therefore, an improved scheme for automatically generating subtitles and a related product thereof are highly needed, which can automatically generate subtitle texts contrasting with voices in multimedia files, and the whole process does not need manual intervention, so that labor cost and time cost are effectively saved.

In this context, embodiments of the present invention are intended to provide a scheme for automatically generating subtitles and related products thereof.

In a first aspect of embodiments of the present invention, a method for automatically generating subtitles is provided, including: acquiring audio information related to a voice time sequence in a multimedia file; acquiring a voice recognition text of the audio information; performing text readability enhancement processing on the voice recognition text; and generating a subtitle text which is contrasted with the voice in the multimedia file based on the processed voice recognition text.

In one embodiment of the present invention, obtaining audio information related to the timing of speech in a multimedia file comprises: and carrying out sentence-breaking processing on the voice in the multimedia file by using a preposed sentence-breaking model to obtain the audio information.

In another embodiment of the present invention, the sentence-breaking processing the speech in the multimedia file by using the pre-sentence-breaking model comprises: and performing frame-by-frame prediction on the voice in the multimedia file in a time sequence by utilizing the preposed sentence break model to obtain a sentence break label of each audio frame, wherein the sentence break label comprises four types: silence at the beginning of a sentence, silence in a sentence, silence at the end of a sentence and audio; and determining the first frame of the audio frames with the type of silence at the tail of the sentence as a period break point.

In yet another embodiment of the present invention, wherein the pre-sentence pattern comprises a gated recurrent neural network model and a conditional random field decoding module, and the pre-sentence pattern is pre-trained via: generating a training corpus according to training audio with text labels, wherein each audio frame in the training corpus is aligned with the corresponding text time sequence and is labeled with the type of a sentence break label; performing frame-by-frame prediction on each audio frame in a time sequence by utilizing the gated recurrent neural network model and the conditional random field decoding module to obtain a predicted sentence break label; and training the preposed punctuation model based on the marked punctuation label and the predicted punctuation label.

In still another embodiment of the present invention, further comprising: and responding to the fact that the audio information exceeds the preset time, and performing sentence segmentation processing on the audio information again.

In one embodiment of the invention, the text readability enhancing process on the speech recognition text comprises one or more of the following: performing format adjustment at least comprising adding punctuation on the voice recognition text; and performing sentence-breaking processing by combining the voice recognition text and the audio information by using a post sentence-breaking model.

In another embodiment of the present invention, the post-sentence truncation model includes a time-convolutional neural network model with attention mechanism and a conditional random field decoding module, and the performing of the sentence truncation process using the post-sentence truncation model in combination with the speech recognition text and the audio information includes: encoding the audio information using the temporal convolutional neural network model; fusing the coded output of the time convolution neural network model and the voice recognition text corresponding to the audio information by using the attention mechanism to obtain fused information; decoding the fusion information by using the conditional random field decoding module to obtain a sentence break label of each audio frame in the audio information, wherein the sentence break label comprises four types: silence at the beginning of a sentence, silence in a sentence, silence at the end of a sentence and audio; and determining the first frame which is in the audio frames within each preset time length and is of the type of silence at the tail of the sentence as a sentence break point.

In yet another embodiment of the present invention, formatting the speech recognition text including at least adding punctuation comprises: performing written form adjustment on the voice recognition text; and utilizing the format adjustment model to adjust punctuation and/or case format of the speech recognition text which is adjusted by the book-surface form.

In yet another embodiment of the present invention, wherein the format adjustment model comprises an encoder, a punctuation decoder and a case decoder, the adjusting the punctuation and case format of the vocalized form adjusted speech recognition text using the format adjustment model comprises: encoding the speech recognition text with the encoder; decoding the encoded output of the encoder with the punctuation decoder to obtain a punctuation output; and decoding the encoded output and the punctuation output using the case decoder to obtain a case output.

In another embodiment of the present invention, the format adjustment model is pre-trained, and the training samples are obtained as follows: performing text alignment on the calibration text and the recognition text obtained through voice recognition; and performing punctuation calibration on the recognition text by using the calibration text to serve as the training sample.

In one embodiment of the invention, the performing the written form adjustment on the speech recognition text comprises: constructing a text conversion model for text-to-book conversion; converting the spoken text in the speech recognition text by using the text conversion model; and merging the timestamps of the converted texts to realize text time sequence alignment.

In another embodiment of the present invention, wherein the text conversion model comprises a weighted finite state machine, converting spoken text in the speech recognition text using the text conversion model comprises: converting the spoken text in the speech recognition text into a written form using the weighted finite state machine, wherein the spoken text includes at least one or more of numeric text, monetary text, time text, date text, unit text, and numbered text.

In yet another embodiment of the present invention, the processed speech recognition text includes one or more punctuation sentences, and the generating of the subtitle text against the speech in the multimedia file based on the processed speech recognition text includes: and determining the subtitle text based on the punctuation quantity of the punctuation of the specified type in each punctuation and the quantity of words contained in the punctuation.

In another embodiment of the present invention, determining the subtitle text based on the punctuation amount of the punctuation of the specified type in each sentence fragment and the vocabulary amount contained therein comprises: in response to the punctuation quantity being greater than a first threshold and the vocabulary quantity being less than a second threshold, determining the subtitle text based on the punctuation; or in response to the punctuation quantity being greater than the first threshold and the vocabulary quantity being greater than the third threshold, re-sentence breaking the sentence break and determining the subtitle text based on the re-sentence breaking result.

In a second aspect of embodiments of the present invention, there is provided an apparatus comprising: a processor; and a memory storing computer instructions for automatically generating subtitles, which when executed by the processor, cause the apparatus to perform the method according to the preceding and following embodiments.

In a third aspect of embodiments of the present invention, there is provided a computer readable storage medium containing program instructions for automatically generating subtitles, which when executed by a processor, cause the apparatus to perform the method according to the preceding and following embodiments.

According to the scheme for automatically generating the caption and the related products thereof, the caption text which is contrasted with the voice in the multimedia file can be automatically generated without manual excessive intervention. Specifically, the scheme of the invention acquires the voice recognition text of the audio information related to the voice time sequence and acquires the corresponding subtitle text by using the text readability enhancement processing on the voice recognition text, thereby not only effectively saving the labor cost and the time cost, but also ensuring the high matching between the subtitle text and the voice. In some embodiments of the invention, the preposed sentence-break model can be used for predicting and processing multiple types of sentence-break labels for voice in a multimedia file, and the silence in the middle and at the tail of a sentence can be accurately distinguished by combining multiple different types of sentence-break labels, so that some misjudgment probabilities in the sentence-break processing process can be effectively reduced. In other embodiments of the present invention, a pre-trained format adjustment model may be used to perform punctuation adjustment on a speech recognition text, and a training sample of the format adjustment model is obtained by performing text alignment processing and punctuation calibration processing on the recognition text obtained by speech recognition and a calibration text, that is, noise of speech misrecognition is introduced in the training process of the format adjustment model, so that robustness of the model under the condition of speech misrecognition can be effectively improved.

In addition, in still other embodiments of the present invention, a post-sentence-segmentation model is further introduced to perform sentence segmentation processing in combination with the speech recognition text and the audio information, so as to implement multi-modal fusion of the text and the speech based on the combination of the time convolution neural network model and the attention mechanism, thereby further enhancing the sentence segmentation accuracy. In addition, in some embodiments of the present invention, a conditional random field decoding module is introduced in the process of predicting the type of the punctuation mark by the pre-punctuation model and the post-punctuation model, so that the smoothness and robustness of prediction can be greatly enhanced.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates a block diagram of an exemplary computing system 100 suitable for implementing embodiments of the present invention;

FIG. 2 schematically illustrates a flow diagram of a method of automatically generating subtitles in accordance with one embodiment of the present invention;

fig. 3 schematically shows a flow chart of a method of automatically generating subtitles according to another embodiment of the present invention;

FIG. 4 schematically illustrates a training framework diagram of a pre-sentence pattern in accordance with an embodiment of the present invention;

FIG. 5 is a flow diagram that schematically illustrates a process for sentence-breaking a speech in a multimedia file based on a pre-sentence-breaking model, in accordance with an embodiment of the present invention;

FIG. 6 schematically illustrates a weighted finite state machine based text conversion process diagram according to an embodiment of the present invention;

FIG. 7 schematically illustrates a training framework diagram of a punctuation model according to an embodiment of the invention;

FIG. 8 schematically illustrates a training framework diagram of a post sentence break model in accordance with an embodiment of the present invention;

FIG. 9 is a flow diagram that schematically illustrates a process for sentence-breaking a speech in a multimedia file based on a post-sentence-breaking model, in accordance with an embodiment of the present invention; and

fig. 10 schematically shows a schematic block diagram of an apparatus according to an embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 illustrates a block diagram of an exemplary computing system 100 suitable for implementing embodiments of the present invention. As shown in fig. 1, computing system 100 may include: a Central Processing Unit (CPU)101, a Random Access Memory (RAM)102, a Read Only Memory (ROM)103, a system bus 104, a hard disk controller 105, a keyboard controller 106, a serial interface controller 107, a parallel interface controller 108, a display controller 109, a hard disk 110, a keyboard 111, a serial external device 112, a parallel external device 113, and a display 114. Among these devices, coupled to the system bus 104 are a CPU 101, a RAM 102, a ROM 103, a hard disk controller 105, a keyboard controller 106, a serial controller 107, a parallel controller 108, and a display controller 109. The hard disk 110 is coupled to the hard disk controller 105, the keyboard 111 is coupled to the keyboard controller 106, the serial external device 112 is coupled to the serial interface controller 107, the parallel external device 113 is coupled to the parallel interface controller 108, and the display 114 is coupled to the display controller 109. It should be understood that the block diagram of the architecture depicted in FIG. 1 is for purposes of illustration only and is not intended to limit the scope of the present invention. In some cases, certain devices may be added or subtracted as the case may be.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: the term "computer readable medium" as used herein refers to any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive example) of the computer readable storage medium may include, for example: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Embodiments of the present invention will be described below with reference to flowchart illustrations of methods and block diagrams of apparatuses (or systems) of embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

According to the embodiment of the invention, a method for automatically generating subtitles and a related product thereof are provided. Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Summary of The Invention

The inventor finds that the existing generation mode of the reference text of the hearing materials is time-consuming and labor-consuming. For example, the prior art is limited to relying on manual entry of textual information and sentence-breaking processing. The manual generation of the contrast text requires a large amount of time and labor cost, and cannot be applied to actual needs.

Based on this, the inventors found that the subtitle text can be generated using the speech recognition text of the audio information related to the speech timing. Therefore, the subtitle text which is related to the voice time sequence and reasonable in sentence break can be generated without manual intervention.

In addition, the inventor also finds that a Voice endpoint Detection (VAD) commonly used in the Voice recognition technology can only distinguish between silence and non-silence, and cannot distinguish between silence in a sentence and silence at the end of the sentence, which may cause a certain degree of misjudgment. For example, silence is cut off in sentences, or silence is too short to be cut off at the end of a sentence, etc. Based on the method, the inventor finds that accurate sentence break can be realized by finely dividing the audio sentence break label, so that some misjudgment probabilities in the sentence break processing process are effectively reduced.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Exemplary method

A method of automatically generating subtitles according to an exemplary embodiment of the present invention is described below with reference to fig. 2. It should be noted that the embodiments of the present invention can be applied to any applicable scenarios.

Fig. 2 schematically shows a flow diagram of a method 200 of automatically generating subtitles according to one embodiment of the invention. As shown in fig. 2, at step S201, audio information related to the timing of speech in a multimedia file may be acquired. It should be noted that the multimedia file may include a song, audio given by a book, story audio, movie audio, or other audio/video files with a subtitle output requirement. The audio information may include audio to be recognized that is time and sequence related (e.g., time aligned and sequence aligned) to the speech in the multimedia file.

Next, at step S202, the speech recognition text of the aforementioned audio information may be acquired. In some embodiments, the aforementioned audio information may be converted into Speech Recognition text using Speech Recognition technology (ASR). The ASR techniques may use a machine to automatically convert the user's speech content into corresponding text. It should be noted that, the ASR technology is only used as an example for illustration, and the scheme of the present invention is not limited thereto. For example, other techniques that enable automatic text conversion by speech may also be employed.

Next, at step S203, the text readability enhancing process may be performed on the aforementioned speech recognition text. The readability enhancement process herein may be understood to be processes that can increase the readability of the speech recognized text, such as the addition of punctuation, font formatting adjustments (e.g., case, italics, bolding, etc.), text writing adjustments or sentence breaks, and the like. It should be noted that the description of the readability enhancement processing is merely an exemplary description, and the inventive arrangements are not limited thereto.

Next, at step S204, subtitle text against the speech in the multimedia file may be generated based on the processed speech recognition text. Therefore, the whole subtitle text generation process does not need manual intervention, the corresponding subtitle text can be obtained by obtaining the voice recognition text of the audio information related to the voice time sequence and utilizing the text readability enhancement processing of the voice recognition text, the labor cost and the time cost can be effectively saved, and the high matching between the subtitle text and the voice can be ensured.

Fig. 3 schematically shows a flow chart of a method 300 for automatically generating subtitles according to another embodiment of the invention. It should be noted that the method 300 can be understood as further defining and supplementing the steps of fig. 2. Therefore, the description above in connection with fig. 2 also applies below.

As shown in fig. 3, at step S301, a pre-sentence-breaking process may be performed on the speech in the multimedia file. As previously described, the multimedia file may include various types of audio and video files requiring subtitle output. Specifically, the speech in the multimedia file can be subjected to pre-sentence-breaking processing by utilizing a pre-sentence-breaking model and/or a forced sentence-breaking model. For example, in some embodiments, a preliminary sentence break process may be performed on speech in a multimedia file using a pre-sentence break model. And then, carrying out sentence segmentation processing again on the audio information which is still longer than the preset time after the preliminary sentence segmentation processing so as to enable the audio information not to exceed the preset time. It should be noted that the predetermined time period can be adjusted according to the actual design requirement (for example, 20 s). In other embodiments, if the audio information obtained after the preliminary sentence-breaking of the speech is processed by the pre-sentence-breaking model does not exceed the predetermined time, no forced sentence-breaking is required. In still other embodiments, the sentence-breaking process may be performed first, and then the sentence-breaking process may be further performed in combination with the preceding sentence-breaking model.

In some embodiments, the pre-sentence models in the context of the present invention may be pre-trained. Fig. 4 schematically illustrates one possible training architecture diagram 400 for a pre-sentence pattern. As shown in fig. 4, the pre-sentence model may include a gated recurrent neural network model GRU and a conditional random field decoding module CRF. Wherein training the pre-sentence break model may include generating a corpus from training audio with text labels. Next, a gated recurrent neural network model GRU and a conditional random field decoding module CRF may be used to perform frame-by-frame prediction on each audio frame in time sequence to obtain a predicted sentence break label. And then, training the preposed punctuation model based on the marked punctuation label and the predicted punctuation label.

Specifically, the training audio Dtrain may be subjected to text labeling, and then the training audio Dtrain is subjected to forced alignment by using the speech recognition technology ASR, so as to obtain aligned corpus Xt of the audio and the corresponding text in the time dimension. In addition, the sentence break label Yt of each audio frame in the corpus Xt may be labeled. The sentence break tag types may include sentence beginning silence (0), sentence middle silence (2), sentence end silence (3), and audio (1). It should be noted that the division of the type of the sentence break label and the corresponding representative number symbol are only exemplary, and the scheme of the present invention is not limited thereto. Then, the gated recurrent neural network model GRU performs frame-by-frame prediction on each audio frame in the training corpus Xt to obtain an intermediate variable ht. Then, the intermediate variable ht is processed by a conditional random field decoding module CRF to obtain a predicted sentence break label

Punctuation-based punctuation label Yt and predicted punctuation label

And obtaining a Loss function Loss, and training the preposed sentence-breaking model by using the Loss function Loss.

Further, in some embodiments, the pre-sentence model trained in fig. 4 may be used to perform sentence-breaking processing on the speech in the multimedia file. Fig. 5 schematically illustrates one possible sentence break processing method 500 using the pre-sentence break model. As shown in fig. 5, at step S501, a speech in a multimedia file may be temporally predicted frame by frame using a pre-sentence model to obtain a sentence break tag for each audio frame. Specifically, the sentence break label of each audio frame may be any type of beginning-of-sentence silence, mid-sentence silence, end-of-sentence silence, and audio. Next, at step S502, the first frame of the audio frame with the type of end-of-sentence silence may be determined as a period break. For example, the sentence break label of a segment of "0011211133" audio frame is obtained through the preceding sentence break model prediction, and the 9 th frame in the segment of audio frame is the beginning frame and the end frame is mute (3). At this time, the 9 th frame in the audio frame segment may be used as a period break point.

After the pre-sentence break processing is completed, return to fig. 3. Next, at step S302, speech recognition may be performed on the audio information after the preceding sentence break processing to obtain a speech recognition text. As previously mentioned, speech recognition herein may employ ASR speech recognition techniques or other techniques that enable conversion between speech and text.

Next, at step S303, text readability enhancement processing may be performed. For example, spoken text in the speech recognition text may be converted into a written text at step S303-1. Next, the phonetic recognition text adjusted in the book format may be subjected to punctuation and/or case format adjustment at step S303-2. Sentence-breaking processing can also be performed at step S303-3 by using a post-sentence-breaking model in combination with the aforementioned speech recognition text and audio information. It should be noted that, the execution sequence of steps S303-1, S303-2, and S303-3 is not limited herein, and may be adjusted according to the actual design requirement. For example, S303-3 may be processed in parallel with S303-1, S303-2.

In some embodiments, the conversion of spoken text into written text may be accomplished using a text conversion model. Fig. 6 schematically illustrates one possible conversion process 600 that utilizes a text conversion model. The text conversion model may employ a weighted finite state machine. Specifically, a Weighted Finite-State machine (WFST) may be constructed according to the text de-regularization rule of the spoken text. Then, the input text is converted by the WFST. As shown in fig. 6, each dashed box represents a basic finite state machine element. When a piece of text is input, the text conversion is automatically carried out according to a weight and the minimum path. Therefore, when the input is "one hundred and thirtien", the article is converted to "113" (the sum of weights is 1.1) instead of "100 and 13" (the sum of weights is 12.2). The timestamps of the converted text may then be combined to achieve text timing alignment. For example, the timestamps of the four words "one hundred and thirteen" may be merged into one to align with the text "113". The spoken text herein includes, but is not limited to, one or more of numeric text, monetary text, time text, date text, unit text, and numbered text.

The text conversion model can also be realized by an end-to-end model of a Seq2Seq architecture, and the encoder/decoder involved in the text conversion model can be a recurrent neural network model or a Transformer type model. When training data is sufficient, text denormalization using an end-to-end model will generally work better than WFST, especially in the case of misrecognized noise inputs.

In some embodiments, the formatting model may be utilized to make punctuation and/or capitalization format adjustments to the vocalized form-adjusted speech recognition text. Fig. 7 schematically illustrates one possible training framework 700 for the format adjustment model. As shown in fig. 7, the format adjustment model may include an encoder, a punctuation decoder, and a case decoder. The encoder comprises a recurrent neural network model, a Transformer, a BERT and other attention-based models. Punctuation and case decoders may include models of multi-layered perceptrons, conditional random fields, and the like. When training the format adjustment model, the training samples X in the training data set Dtrain may be sent to the encoder to obtain the encoded vector H. Then, the encoded vectors H are input to a punctuation decoder, respectively. And then, the output of the punctuation decoder and the coded vector H are spliced and input into the capitalization decoder, and the punctuation decoder and the capitalization decoder respectively output punctuation corresponding to each word of the sentence and capital and small case prediction sequences ypunc and ycapt. The loss values Lpunct and Lcapt are then calculated using the punctuation and case prediction sequences Ypunct and Ycapt and the training data sets Ypunct and Ycapt. And then the back propagation adjustment parameters are carried out, and the iteration is repeated until the loss value is converged so as to train the format adjustment model.

In some embodiments, text alignment of the calibration text with the recognized text obtained via speech recognition may be employed. The recognition text can be recognized by the ASR technology, and the alignment processing can be performed on the calibration text and the recognition text recognized by the ASR technology by adopting a text alignment method such as an edit distance algorithm. Then, the recognition text may be punctuated by using the calibration text as the aforementioned training sample X. Therefore, by restoring the correct punctuations on the recognition text, the input data of the model training can be ensured to be consistent with the actual use scene to the maximum extent. The training sample is used for training the format adjustment model, which is beneficial to greatly enhancing the robustness of the format adjustment model under the condition of error recognition of the ASR technology.

In some embodiments, the speech recognition text may be encoded using an encoder (e.g., a recurrent neural network model, a Transformer, a BERT, etc.) in the trained format adjustment model described above. The encoded output of the encoder may then be decoded using a punctuation decoder (e.g., a model of a multi-layered perceptron, conditional random field, etc.) to obtain a punctuation output. The coded output and punctuation output may also be decoded using a case decoder (e.g., a model of a multi-layered perceptron, conditional random fields, etc.) to obtain a case output. Therefore, the punctuation and capital and lower case format adjustment of the voice recognition text is realized.

Further, in some embodiments, a post sentence-break model may be utilized in conjunction with the aforementioned speech recognition text and audio information to perform sentence-break processing. Fig. 8 schematically illustrates one possible training framework 800 for the post sentence break model. As shown in fig. 8, the post-sentence pattern may comprise a time convolutional neural network model TCN with attention mechanism and a conditional random field decoding module CRF. Wherein the attention mechanism may adopt an attention mechanism (hereinafter referred to as ATT). The training of the post sentence-punctuation model may include performing Text tagging Text on a training audio Dtrain, and then performing forced alignment on the training audio Dtrain by using a speech recognition technology ASR to obtain a training corpus X in which the audio and the corresponding character are aligned in a time dimension. In addition, a sentence break label Y may be labeled for each audio frame in the corpus X. The sentence break tag types may include sentence beginning silence, sentence middle silence, sentence end silence, and audio.

Then, the time convolutional neural network model TCN performs frame-by-frame prediction on each audio frame in the corpus X to obtain an intermediate variable htcn. Then, the intermediate variable htcn and the tagged Text can be fused using ATT and the hidden variable hatt is output. Then, after the conditional random field decoding module CRF processes, the predicted sentence break label is obtained

Punctuation-based punctuation label Y and predicted punctuation label

And obtaining a Loss function Loss, and training the postsentence-segmentation model by using the Loss function Loss.

Further, in some embodiments, sentence break processing may be performed using the post-sentence break model trained in fig. 8. Fig. 9 schematically illustrates one possible sentence break processing method 900 that utilizes this post-sentence break model. As shown in fig. 9, at step S901, audio information may be encoded using the aforementioned time convolutional neural network model (e.g., TCN). Next, at step S902, the encoded output of the time-convolutional neural network model and the speech recognition text corresponding to the audio information may be fused using an attention mechanism (e.g., ATT) to obtain fused information. Next, at step S903, the aforementioned fused information may be decoded by using a conditional random field decoding module (e.g., CRF) to obtain a sentence break label of each audio frame in the audio information. As described above, the sentence break label may include four types of sentence beginning silence, sentence middle silence, sentence end silence, and audio. Then, at step S904, the beginning frame of the audio frame with the type of silence end in each predetermined period may be determined as the period break. Therefore, sentence segmentation is carried out by introducing a post-sentence segmentation model and combining the speech recognition text and the audio information, and multi-mode fusion of the text and the speech is realized by combining a time convolution neural network model and an attention system, so that the sentence segmentation accuracy is further enhanced.

After the readability enhancement processing for the speech recognition text is completed, return to fig. 3 is made. Next, in step S304, the subtitle text may be determined according to the punctuation amount and the vocabulary amount in the punctuation of the type specified in each punctuation. Wherein a specified type punctuation may include a period and punctuation at a level equal to the period (e.g., an exclamation point, a question mark, etc. that can distinguish between different complete semantics). Specifically, if the number of the designated type punctuation is detected to be greater than the first threshold, the number of words under each such punctuation can be further detected. If the vocabulary number is smaller than the second threshold value, the vocabulary number is less, and the caption text can be determined based on the punctuation. For example, "OK! I'll be there son. "the sentence break includes 2 sentences with complete speech, but the vocabulary of each sentence is not much, and at this time, the sentence break process is not needed to be further carried out. The sentence break result may be retained and treated as the subtitle text. If the vocabulary quantity is larger than the third threshold value, the description vocabulary quantity is large, and the sentence break can be performed again to determine the caption text based on the result of sentence break again. Therefore, the readability of the subtitle text is further enhanced, and the reading experience of the user is improved. The first threshold, the second threshold and the third threshold can be adjusted according to actual design requirements.

Based on this, the user can be helped to solve the input problem of various use scenarios (e.g., hearing education, learning, and the like). Particularly, under partial scenes, a user only inputs audio, the scheme of the invention can efficiently obtain the subtitle text, and the time cost and the labor cost for manually producing the subtitle text are reduced. Meanwhile, compared with the traditional manual production mode, the scheme for automatically generating the subtitles can better solve the requirements of single-line text viewing or follow-up reading and single sentence circulation in hearing in large batch. For example, without limitation to hearing or type of audio file (e.g., song, book-gifted audio, story audio, movie audio, etc.), the multimedia file can be processed to output a corresponding subtitle file. In addition, the generated caption file can be reasonably and accurately punctuated, and the user can conveniently practice the listening comprehension. For example, the user may need to perform intensive exercises against a certain text (e.g., follow-up reading, repeated listening to the current sentence, etc.). The conversion of spoken and written language during speech recognition, such as shown by the numeral one hondred and thirteen as 113, may also be intelligently processed for easier reading and comparison.

Exemplary device

Having described the method of the exemplary embodiment of the present invention, next, a related product of automatically generating subtitles according to the exemplary embodiment of the present invention will be described with reference to fig. 10.

Fig. 10 schematically shows a schematic block diagram of an apparatus 1000 according to an embodiment of the present invention. As shown in fig. 10, the device 1000 may include a processor 1001 and a memory 1002. Wherein the memory 1002 stores computer instructions for automatically generating subtitles, which when executed by the processor 1001, cause the device 1000 to perform the method according to the preceding description in connection with fig. 2 and 3. For example, in some embodiments, device 1000 may perform acquisition of audio information, acquisition of speech recognition text, generation of subtitle text, and/or the like. Based on this, the subtitle text contrasting with the speech in the multimedia file can be automatically generated by the device 1000 without manual intervention.

In some implementation scenarios, device 1000 may include a device with multimedia file input and voice information processing functions and subtitle display functions (e.g., various intelligent electronic products such as cell phones, PCs, etc.). In practical applications, the apparatus 1000 may be composed of an apparatus having a plurality of functions as described above, or may be composed of a combination of a plurality of apparatuses having partial functions. The solution of the invention does not limit the structural design that the device 1000 may have.

It should be noted that although in the above detailed description several devices or sub-devices are mentioned which automatically generate subtitles, this division is only not mandatory. Indeed, the features and functions of two or more of the devices described above may be embodied in one device, according to embodiments of the invention. Conversely, the features and functions of one apparatus described above may be further divided into embodiments by a plurality of apparatuses.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Use of the verbs "comprise", "comprise" and their conjugations in this application does not exclude the presence of elements or steps other than those stated in this application. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. A method for automatically generating subtitles, comprising:

acquiring audio information related to a voice time sequence in a multimedia file;

acquiring a voice recognition text of the audio information;

performing text readability enhancement processing on the voice recognition text; and

and generating a subtitle text which is contrasted with the voice in the multimedia file based on the processed voice recognition text.

2. The method of claim 1, wherein obtaining audio information related to the timing of speech in the multimedia file comprises:

and carrying out sentence-breaking processing on the voice in the multimedia file by using a preposed sentence-breaking model to obtain the audio information.

3. The method of claim 2, wherein using a pre-sentence-break model to break speech in the multimedia file comprises:

and performing frame-by-frame prediction on the voice in the multimedia file in a time sequence by utilizing the preposed sentence break model to obtain a sentence break label of each audio frame, wherein the sentence break label comprises four types: silence at the beginning of a sentence, silence in a sentence, silence at the end of a sentence and audio; and

and determining the first frame which is in the audio frame and is silent in the type of sentence end as a sentence break point.

4. The method of claim 3, wherein the pre-sentence pattern comprises a gated recurrent neural network pattern and a conditional random field decoding module, and wherein the pre-sentence pattern is pre-trained by:

generating a training corpus according to training audio with text labels, wherein each audio frame in the training corpus is aligned with the corresponding text time sequence and is labeled with the type of a sentence break label;

performing frame-by-frame prediction on each audio frame in a time sequence by utilizing the gated recurrent neural network model and the conditional random field decoding module to obtain a predicted sentence break label; and

and training the preposed punctuation model based on the marked punctuation label and the predicted punctuation label.

5. The method of claim 2, further comprising:

and responding to the fact that the audio information exceeds the preset time, and performing sentence segmentation processing on the audio information again.

6. The method of claim 1, wherein performing text readability enhancement processing on the speech recognition text comprises one or more of:

performing format adjustment at least comprising adding punctuation on the voice recognition text; and

and performing sentence-breaking processing by combining the voice recognition text and the audio information by using a post sentence-breaking model.

7. The method of claim 6, wherein the post-sentence pattern comprises a time-convolutional neural network model with attention mechanism and a conditional random field decoding module, and wherein using the post-sentence pattern in conjunction with the speech recognized text and the audio information to perform sentence break processing comprises:

encoding the audio information using the temporal convolutional neural network model;

fusing the coded output of the time convolution neural network model and the voice recognition text corresponding to the audio information by using the attention mechanism to obtain fused information;

decoding the fusion information by using the conditional random field decoding module to obtain a sentence break label of each audio frame in the audio information, wherein the sentence break label comprises four types: silence at the beginning of a sentence, silence in a sentence, silence at the end of a sentence and audio; and

and determining the first frame which is in the audio frames within each preset time length and is of the type of silence at the tail of the sentence as a period break point.

8. The method of claim 6, wherein formatting the speech recognition text including at least adding punctuation comprises:

performing written form adjustment on the voice recognition text; and

and utilizing the format adjustment model to adjust punctuation and/or case format of the speech recognition text which is subjected to the book formatting adjustment.

9. The method of claim 8, wherein the format adjustment model comprises an encoder, a punctuation decoder, and a case decoder, and wherein the adjusting the punctuation and case format of the vocalized form-adjusted speech recognition text using the format adjustment model comprises:

encoding the speech recognition text with the encoder;

decoding the encoded output of the encoder with the punctuation decoder to obtain a punctuation output; and

and decoding the encoded output and the punctuation output by using the case decoder to obtain case output.

10. The method of claim 9, wherein the format adjustment model is pre-trained, and training samples are obtained as follows:

performing text alignment on the calibration text and the recognition text obtained through voice recognition; and

and performing punctuation calibration on the recognition text by using the calibration text to serve as the training sample.

11. The method of claim 8, wherein performing a written form adjustment on the speech recognition text comprises:

constructing a text conversion model for text-to-book conversion;

converting the spoken text in the speech recognition text by using the text conversion model; and

and merging the timestamps of the converted texts to realize text time sequence alignment.

12. The method of claim 11, wherein the text conversion model comprises a weighted finite state machine, and wherein converting spoken text in the speech recognition text using the text conversion model comprises:

converting the spoken text in the speech recognition text into a written form using the weighted finite state machine, wherein the spoken text includes at least one or more of numeric text, monetary text, time text, date text, unit text, and numbered text.

13. The method of any of claims 1-12, wherein the processed speech recognition text comprises one or more punctuation sentences, and wherein generating caption text against speech in the multimedia file based on the processed speech recognition text comprises:

and determining the subtitle text based on the punctuation quantity of the punctuation of the specified type in each punctuation and the quantity of words contained in the punctuation.

14. The method of claim 13, wherein determining the subtitle text based on the number of punctuation points and the number of words contained therein for the type of punctuation points specified in each sentence fragment comprises:

in response to the punctuation quantity being greater than a first threshold and the vocabulary quantity being less than a second threshold, determining the subtitle text based on the punctuation; or

And in response to the punctuation quantity being greater than a first threshold and the vocabulary quantity being greater than a third threshold, re-sentence breaking the sentence break, and determining the subtitle text based on the re-sentence breaking result.

15. An apparatus, comprising:

a processor; and

a memory storing computer instructions for automatically generating subtitles, which when executed by the processor, cause the apparatus to perform the method according to any one of claims 1-14.

16. A computer-readable storage medium containing program instructions for automatically generating subtitles, which when executed by a processor, cause the method according to any one of claims 1 to 14 to be carried out.