CN110798733A

CN110798733A - Subtitle generating method and device, computer storage medium and electronic equipment

Info

Publication number: CN110798733A
Application number: CN201911047803.1A
Authority: CN
Inventors: 崔建伟; 蔡贺; 黄建新; 张歆; 黄伟峰; 朱米春; 杜伟; 王一韩; 闫磊; 钱岳
Original assignee: Central Platform
Current assignee: Central Platform; China Central TV Station
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-14

Abstract

A subtitle generating method, a subtitle generating device, a computer storage medium and an electronic device comprise the following steps: determining an audio file of a program; the audio file is transcribed into characters to obtain text data with time code information corresponding to the audio file; matching the text data with the standard manuscript of the program; and adding the time code information of the text data to the standard manuscript according to the matched result to obtain a subtitle file with the time code information. By adopting the scheme in the application, the automatic matching of the television program caption and the voice content is realized by performing voice recognition on the audio, so that the audio and the text time code are synchronous, and the caption text has time code information.

Description

Subtitle generating method and device, computer storage medium and electronic equipment

Technical Field

The present application relates to a program production technology, and in particular, to a method and an apparatus for generating subtitles, a computer storage medium, and an electronic device.

Background

In the audio and video post-subtitle making link of the media industry, the subtitle and the voice content of a television program cannot be automatically matched, but the audio and the subtitle need to be synchronized when the television program is broadcasted. At present, when subtitles are manufactured, a worker needs to flap a time axis through subtitle software, and the process of manufacturing the subtitles includes:

the first step is as follows: the working personnel need to import the well-listened and written caption file into the caption software;

the second step is that: then importing the corresponding audio/video file;

the third step: opening a time axis for making;

the fourth step: tapping determines the start time of the first sentence time code;

the fifth step: tapping to determine the end time of the first sentence time code;

and a sixth step: checking whether the start time and the end time of the first sentence time code are correct;

the seventh step: repeating the fourth, fifth and sixth steps, and beating the time codes of the second sentence and the following sentence;

...

and (N) step: the srt file is exported.

From the above process, the time code axis link is the most minute and tedious link. The staff needs to listen to the audio, watch the mouth shape of the person in the video and flap the time axis at the same time, and needs to play the current audio and video clip again for checking when the time axis of each sentence is flapped. In this case, the modification of the time axis of the following associated sentence will be directly affected as soon as the time code of one of the sentences is flapped wrong or needs to be modified for some reason.

Problems existing in the prior art:

at present, in the media industry, particularly in the broadcasting and television industry, massive audio and video programs need to be broadcast, and each program file needs to be matched with subtitles purely manually by workers. Taking video subtitles as an example, when a worker performs manual matching, the worker needs to take video, audio and subtitles into consideration simultaneously, and needs to listen and write repeatedly and repeatedly correct the subtitles, so that the process is complicated and the efficiency is low. Moreover, if the staff member finds that one of the sentences needs to be re-modified, the time code affected by the association later needs to be re-modified.

Disclosure of Invention

The embodiment of the application provides a subtitle generating method and device, a computer storage medium and electronic equipment, so as to solve the technical problems.

According to a first aspect of the embodiments of the present application, there is provided a subtitle generating method, including the steps of:

determining an audio file of a program;

the audio file is transcribed into characters to obtain text data with time code information corresponding to the audio file;

matching the text data with the standard manuscript of the program;

and adding the time code information of the text data to the standard manuscript according to the matched result to obtain a subtitle file with the time code information.

According to a second aspect of the embodiments of the present application, there is provided a subtitle generating apparatus including:

the audio determining module is used for determining an audio file of the program;

the text generation module is used for transcribing the audio file into characters to obtain text data with time code information corresponding to the audio file;

the matching module is used for matching the text data with the standard manuscript of the program;

and the time code attaching module is used for attaching the time code information of the text data to the standard manuscript according to the matched result to obtain the subtitle file with the time code information.

According to a third aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the subtitle generating method as described above.

According to a fourth aspect of embodiments of the present application, there is provided an electronic device comprising a memory for storing one or more programs, and one or more processors; the one or more programs, when executed by the one or more processors, implement the subtitle generating method as described above.

By adopting the subtitle generating method and device, the computer storage medium and the electronic equipment provided by the embodiment of the application, after the audio file of a program is determined, the text data with time code information is obtained by performing voice recognition on the audio file, then the text data is matched with the standard manuscript of the program, the time code information of the text data is attached to the standard manuscript according to the matched result, the subtitle file with the time code information is obtained, and the audio and the subtitle file time code are synchronized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 illustrates a schematic flowchart of an implementation of a subtitle generating method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram illustrating a subtitle generating apparatus according to a second embodiment of the present application;

fig. 3 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application.

Detailed Description

In view of the technical problems in the prior art, embodiments of the present application provide a subtitle generating method and apparatus, a computer storage medium, and an electronic device, which implement automatic matching between subtitles and voice content of a television program by performing voice recognition on audio, so that the audio is synchronized with a text time code (or abbreviated as "time code"), and a subtitle text has time code information.

The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example one

Fig. 1 shows a flowchart of an implementation of a subtitle generating method according to a first embodiment of the present application.

As shown in the figure, the subtitle generating method includes:

step 101, determining an audio file of a program;

102, transcribing the audio file into characters to obtain text data with time code information corresponding to the audio file;

step 103, matching the text data with the standard manuscript of the program;

and step 104, adding the time code information of the text data to the standard manuscript according to the matched result to obtain a subtitle file with the time code information.

When a program is recorded in the field, an audio file or a video file of a character such as a host and a guest can be recorded at the same time, and the audio file and/or the video file can be stored in a computer for the subsequent program production.

After the audio file of the program is determined, the audio file can be further transcribed into characters, and text data with time code information corresponding to the audio file is obtained. Specifically, the existing voice transcription technology or voice recognition technology can be utilized to transcribe the audio file into characters, and the specific transcription process is not described herein any more.

Typically, prior to recording a program, there will be a standard draft of the program, which may typically include the program name, show form, performer, and specific program content organized in chronological order. When the embodiment of the present application is implemented specifically, information such as "program name", "show form", and "performer" may not be recorded with sound, so the audio file described in the embodiment of the present application may only correspond to the specific program content organized according to the chronological order.

After the text data corresponding to the audio file is obtained, the embodiment of the application may further match the text data with the standard manuscript of the program. Due to the fact that words spoken by a possible character during recording of a program are not completely performed according to the content of a standard manuscript, or due to the existence of reasons such as an audio transfer error during the transfer of an audio file, the text data of the audio file may not be completely consistent with the standard manuscript of the program. For example: the text data obtained by transferring the audio file is ' love background exhibition hall ', the standard manuscript is ' love Beijing exhibition hall ', and the embodiment of the application matches the love in the text data with the love in the standard manuscript and matches the exhibition hall in the text data with the exhibition hall ' in the standard manuscript.

And finally, according to the matched result, adding the time code information of the text data to the standard manuscript to obtain the subtitle file with the time code information. For example: the time code information of the text data obtained by the audio file transcription is as follows:

i love background exhibition hall

023 031 036 058

And after the subtitle file is attached to the standard manuscript, obtaining a subtitle file as follows:

i love Beijing exhibition hall

023 031 036 058

By adopting the subtitle generating method provided by the embodiment of the application, after the audio file of a program is determined, the audio file is subjected to voice recognition to obtain text data with time code information, then the text data is matched with the standard manuscript of the program, the time code information of the text data is attached to the standard manuscript according to the matched result to obtain the subtitle file with the time code information, and the audio and the subtitle file time code are synchronous.

In one embodiment, the matching the text data with the standard manuscript of the program includes:

determining the minimum operation times for matching the characters in the text data with the characters in the standard manuscript of the program and the operation steps;

and matching the characters in the text data with the characters in the standard manuscript of the program according to the operation step of the minimum operation times.

In specific implementation, the matching of the text data obtained by transcribing the audio file and the standard manuscript of the program in the embodiment of the present application may specifically refer to matching characters in the text data obtained by transcribing the audio file and characters in the standard manuscript of the program.

Because there may be inconsistent characters between the text data obtained by the transcription of the audio file and the standard manuscript of the program, some operations are required to make all characters capable of being matched in the text data be matched with the corresponding characters in the standard manuscript.

The operation required to be performed may be performed in various manners, for example, to replace all the text data with the standard document, or replace some characters in the text data with characters in corresponding positions in the standard document.

In the embodiment of the application, firstly, the minimum operation times and the operation steps for matching the characters in the text data with the characters in the standard manuscript of the program are determined, and then the characters in the text data are matched with the characters in the standard manuscript of the program according to the operation steps of the minimum operation times to obtain the matched result.

In one embodiment, the determining a minimum number of operations to match characters in the text data with characters in a standard manuscript of the program and the operating step include:

determining the operation times min (d [ i, j ]) and the corresponding operation steps when the character strings s [1 to i ] ═ in the text data are the character strings t [1 to j ] of the standard manuscript for the ith character in the text data; wherein i is more than or equal to 1 and less than or equal to the total number N of characters of the text data, and j is more than or equal to 1 and less than or equal to the total number M of characters in the standard manuscript;

adding 1 to the i, and repeatedly executing the previous step until all characters in the text data are traversed;

and determining the minimum operation times of matching the characters in the text data with the characters in the standard manuscript of the program as min (d [ N, M ]) and min (d [ N, M ]) corresponding to the minimum operation times.

In specific implementation, assuming that there are N characters in the text data, the matching process may be as follows:

for the 1 st character in the text data, if the character string s [1] in the text data is the character string t [1] of the standard manuscript, determining that the operation frequency is 0 and not needing to perform operation;

if the character string s [1] ≠ the character string t [1] of the standard manuscript in the text data, then there can be several cases:

A1) replacing a character string s [1] in the text data with a character string t [1] of the standard manuscript, determining that the operation frequency is 1, and replacing the 1 st character in the text data with the 1 st character of the standard manuscript in the operation step;

B1) judging whether a character string s [2] in the text data is equal to a character string t [1] of the standard manuscript or not;

if the character string s [2] in the text data is the character string t [1] of the standard manuscript, replacing the character string s [1] in the text data with a null character, determining that the operation frequency is 1, and replacing the 1 st character in the text data with the null character in the operation step;

if the character string s [2] ≠ the character string t [1] of the standard manuscript in the text data, the following cases can be included:

B11) judging whether a character string s [3] in the text data is equal to a character string t [1] of the standard manuscript or not;

B12) judging whether the character string s [2] in the text data is equal to the character string t [2] of the standard manuscript or not;

...

C1) judging whether the character string s [1] in the text data is equal to the character string t [2] of the standard manuscript or not;

when a character string s [1] in the text data is a character string t [2] of the standard manuscript, determining that the operation frequency is 1, and adding a substitute character before the 1 st character of the text data in the operation step;

if the character string s [1] in the text data is not equal to the character string t [2] of the standard manuscript, the following cases can be successively divided:

C12) replacing character strings s [ 1-1 ] in the text data with character strings t [ 1-2 ] of the standard manuscript, determining that the operation frequency is 2, and replacing the 1 st and 2 nd characters in the text data with the 1 st and 2 nd characters of the standard manuscript in the operation step;

C22) judging whether a character string s [3] in the text data is equal to a character string t [1] or t [2] of the standard manuscript or not;

...

through the dynamic adjustment process, the characters of the text data and the standard manuscript with the corresponding relation are gradually and successfully matched.

Determining the number of times min (d [1, j ]) of operations when a character string s [1] ═ j of the text data is a character string t [1 to j ] of the standard manuscript and a corresponding operation step for a1 st character in the text data;

determining the operation times min (d [2, j ]) when the character strings s [ 1-2 ] ═ of the character strings t [ 1-j ] of the standard manuscript in the text data and the corresponding operation steps for the 2 nd character in the text data;

.. (j may be different in standard manuscript when operation for each character is determined)

And finally traversing the characters N in the text data and/or the characters M in the standard manuscript of the program to obtain the operation steps corresponding to the minimum operation times min (d [ N, M ]) and min (d [ N, M ]) for matching the characters in the text data and the characters in the standard manuscript of the program.

In one embodiment, the operation step of determining the number of operations min (d [ i, j ]) for the ith character in the text data when the character strings s [1 to i ] in the text data are the character strings t [1 to j ] of the standard document, and the corresponding operation step includes:

if s [1 to i ] can be converted into t [1 to j-1] in k operation steps, determining that the operation frequency when the character string s [1 to i ] in the text data is equal to the character string t [1 to j ] of the standard manuscript is k +1, wherein the operation steps comprise the k operation steps and a step of adding s [1 to i ] to t [ j ];

if s [1 to i-1] can be converted into t [1 to j ] in k operation steps, determining that the operation frequency when the character string s [1 to i ] in the text data is equal to the character string t [1 to j ] of the standard manuscript is k +1, wherein the operation steps comprise the k operation steps and a step of removing s [ i ];

if s [1 to i-1] can be converted into t [1 to j-1] and s [ i ] ≠ t [ j ] within k operation steps, determining that the operation frequency when the character strings s [1 to i ] in the text data are equal to the character strings t [1 to j ] of the standard manuscript is k +1, wherein the operation steps comprise the k operation steps and a step of replacing s [ i ] with t [ j ];

if s [1 to i-1] can be converted into t [1 to j-1] and s [ i ] is equal to t [ j ] in k operation steps, determining that the operation frequency when the character strings s [1 to i ] in the text data are equal to the character strings t [1 to j ] of the standard manuscript is k, wherein the operation steps comprise the k operation steps;

and determining the minimum operation times min (d [ i, j ]) when the character strings s [ 1-i ] - [ the character strings t [ 1-j ] of the standard manuscript in the text data are determined according to the situation and the corresponding operation steps.

In the matching process of each character string, multiple conditions (i.e., multiple operation modes, corresponding to different operation times and operation steps) can be included, and the minimum operation times and operation steps for completing matching of all characters are finally determined through a dynamic planning or dynamic adjustment process in the embodiment of the application.

In one embodiment, after determining the minimum number of operations, the method further comprises:

judging the relevant words which are not matched with the records before and after the word with the wrong position according to the fuzzy syllables; determining whether the word in the wrong position is wrong due to fuzzy syllables through pronunciation;

and when the word at the wrong position is determined to be wrong due to the fuzzy syllables, correcting the word at the wrong position according to the fuzzy syllables.

In specific implementation, it is assumed that a text in the standard manuscript is "you are good this year", and text data obtained through audio transcription is "you and good this year", an error position is "you" later, associated words which are not matched and recorded before and after the word at the error position are "and", and front and back syllables ("jin nian de ni" and "hao") are consistent and correspond to positions in the whole text, at this time, the embodiment of the present application determines that the middle syllable "he" is an error due to a fuzzy syllable according to the words (words which are not matched and recorded) of the fuzzy syllable, and changes "he" to "hen".

In specific implementation, the search judgment needs to be performed in sequence if there is no relevant word in the matching record.

In one embodiment, the operating step includes adding characters, deleting characters, and/or replacing characters to positions in the text data that do not match characters in a standard manuscript of the program.

In a specific implementation, adding a character may refer to adding a null character to a position in the text data that does not match a character in a standard manuscript of the program, for example: when a certain character in the standard manuscript is absent in the text data, a null character can be added at the corresponding position for replacement.

Deleting a character may refer to the presence of a character in the text data that is not present in the standard manuscript, at which time the character may be deleted from the text data.

The replacing character may refer to that a character which is the same as the character pinyin in the standard manuscript but has a different character exists in the text data, and at this time, the character in the text data may be replaced by the character in the standard manuscript.

In an embodiment, the transcribing the audio file into characters to obtain text data with time code information corresponding to the audio file includes:

recognizing each frame of voice of the audio file into a state sequence;

obtaining a plurality of phonemes according to the state sequence of each frame of voice in the audio file;

generating one or more words from the plurality of phonemes;

matching the one or more words with each frame of voice content to obtain the relative time position of the voice fragment corresponding to each word on a time axis;

and determining the time stamp of each word according to the relative time position of the voice clip corresponding to each word on the time axis.

In specific implementation, each frame of speech may be recognized as a state, the states corresponding to each frame of speech are combined into phonemes, and then, a plurality of phonemes are combined into words.

Since speech is a continuous stream of audio, it is usually composed of a mixture of mostly stable states and partially dynamically changing states. The method includes recognizing each frame of speech of an audio file into a state, and decoding the audio file by using techniques such as viterbi decoding in the prior art to obtain a state sequence, where the state sequence may correspond to a plurality of phonemes.

Human languages generally include three elements, namely voice, vocabulary and grammar, and basic vocabulary and grammar construction determine basic appearances of each language. Speech can be understood as the form in which speech is expressed acoustically, i.e. the sound that a person utters when speaking. While sound includes three basic properties of loudness, tone and timbre, the phonemes described in the embodiments of the present application may be understood as the smallest phonetic unit divided from the timbre point of view.

The phonemes can in turn be divided into vowel phonemes and consonant phonemes depending on whether the airflow is impeded during the pronunciation process, for example: a. vowels such as o, e, etc.; b. p, f, etc.

Generally, in Chinese, 2-4 phones can form a syllable (e.g., mei), and a syllable corresponds to a Chinese character (e.g., Mei), i.e., 2-4 phones can form a word/word (e.g., m, e, i three phones form a word/word "Mei").

The audio file is usually played according to a time axis, after the one or more words are obtained, the one or more words can be matched with each frame of voice content, the relative time position of the voice fragment corresponding to each word on the time axis of the audio file is obtained, and the time stamp of each word is determined according to the relative time position of the voice fragment corresponding to each word on the time axis.

Example two

Based on the same inventive concept, the embodiment of the present application provides a subtitle generating apparatus, and the principle of the apparatus for solving the technical problem is similar to that of a subtitle generating method, and repeated parts are not described again.

Fig. 2 is a schematic structural diagram of a subtitle generating apparatus according to a second embodiment of the present application.

As shown in the figure, the subtitle generating apparatus includes:

an audio determining module 201, configured to determine an audio file of a program;

the text generation module 202 is configured to transcribe the audio file into characters to obtain text data with time code information corresponding to the audio file;

the matching module 203 is used for matching the text data with the standard manuscript of the program;

and the time code attaching module 204 is configured to attach the time code information of the text data to the standard manuscript according to the matched result, so as to obtain a subtitle file with time code information.

By adopting the subtitle generating device provided by the embodiment of the application, after the audio file of a program is determined, the text data with time code information is obtained by performing voice recognition on the audio file, then the text data is matched with the standard manuscript of the program, the time code information of the text data is attached to the standard manuscript according to the matched result, the subtitle file with the time code information is obtained, and the audio and the subtitle file time code are synchronous.

In one embodiment, the matching module includes:

an operation determining unit for determining a minimum number of operations for matching characters in the text data with characters in a standard manuscript of the program and an operation step;

and the matching unit is used for matching the characters in the text data with the characters in the standard manuscript of the program according to the operation steps of the minimum operation times.

In one embodiment, the operation determination unit includes:

a character operation subunit, configured to determine, for an ith character in the text data, an operation frequency min (d [ i, j ]) when a character string s [1 to i ] in the text data is equal to a character string t [1 to j ] of the standard document, and a corresponding operation step; wherein i is more than or equal to 1 and less than or equal to the total number N of characters of the text data, and j is more than or equal to 1 and less than or equal to the total number M of characters in the standard manuscript; adding 1 to the i, and repeatedly executing the previous step until all characters in the text data are traversed;

and an operation determining subunit, configured to determine that the minimum operation frequency for matching the characters in the text data with the characters in the standard manuscript of the program is min (d [ N, M ]) and an operation step corresponding to min (d [ N, M ]).

In one embodiment, the character manipulation subunit is specifically configured to:

In one embodiment, the apparatus further comprises:

the fuzzy syllable correction module is used for judging the associated words which are not matched with the records before and after the word at the wrong position according to the fuzzy syllables after determining the minimum operation times; determining whether the word at the wrong position is wrong due to fuzzy syllables through pronunciation; and when the word at the wrong position is determined to be wrong due to the fuzzy syllables, correcting the word at the wrong position according to the fuzzy syllables.

In one embodiment, the text generation module includes:

a first processing unit for recognizing each frame of speech of the audio file into a state sequence;

the second processing unit is used for obtaining a plurality of phonemes according to the state sequence of each frame of voice in the audio file;

a third processing unit for generating one or more words from the plurality of phonemes;

the fourth processing unit is used for matching the one or more words with each frame of voice content to obtain the relative time position of the voice clip corresponding to each word on a time axis;

and the fifth processing unit is used for determining the time stamp of each word according to the relative time position of the voice clip corresponding to each word on the time axis.

EXAMPLE III

Based on the same inventive concept, embodiments of the present application further provide a computer storage medium, which is described below.

The computer storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the steps of the subtitle generating method according to an embodiment.

By adopting the computer storage medium provided by the embodiment of the application, after the audio file of a program is determined, the audio file is subjected to voice recognition to obtain text data with time code information, then the text data is matched with the standard manuscript of the program, the time code information of the text data is attached to the standard manuscript according to the matched result to obtain a subtitle file with the time code information, and the audio and the subtitle file time code are synchronized.

Example four

Based on the same inventive concept, the embodiment of the present application further provides an electronic device, which is described below.

As shown, the electronic device includes memory 301 for storing one or more programs, and one or more processors 302; the one or more programs, when executed by the one or more processors, implement the subtitle generating method according to embodiment one.

By adopting the electronic equipment provided by the embodiment of the application, after the audio file of a program is determined, the text data with time code information is obtained by performing voice recognition on the audio file, then the text data is matched with the standard manuscript of the program, and the time code information of the text data is attached to the standard manuscript according to the matched result to obtain the subtitle file with the time code information, so that the audio and the subtitle file time code are synchronous.

EXAMPLE five

In order to facilitate the implementation of the present application, the embodiments of the present application are described with a specific example.

When a television station makes program subtitles, the following processes can be included:

first, an audio file of a program and a standard manuscript of the program are prepared.

The audio files and standard scripts may typically have a one-to-one correspondence.

Then, the fabrication can be started.

Step 1, firstly, transcribing the audio file to obtain text data with time codes.

The audio file can be transcribed by using an offline engine to obtain text data (or called recognition result) with time codes corresponding to the audio file, and the audio transcription specifically can convert long-segment audio data of more than 5 hours into text data based on a deep full-sequence convolutional neural network, so that a basis is provided for subsequent processing.

Step 2, matching the recognition result with the standard manuscript

Specifically, in the embodiment of the present application, the recognition result is matched with the standard document, the text and the sentence break of the standard document are taken as the standard, all the text and the sentence break of the recognition result are converted into the standard document, and the recognition result processed according to the preset algorithm is equivalent to the standard document attached with the time code.

The algorithm of the embodiment of the present application can be understood as that the character string a is converted into the minimum operand required by the character string B by using the character operation, and the operation is to convert the recognition result into the standard manuscript. In general, the smaller the minimum number of operands for two strings, the more similar they are. If the two strings are equal, their minimum operand is 0 (no operation is required).

Assuming that the character string of the standard manuscript is A and the character string of the recognition result is B, converting the character string B of the recognition result into the character string A of the standard manuscript under the minimum operand, and adding the time code carried by the characters of the character string B of the recognition result to the character string A of the standard manuscript.

The specific algorithm may be: and comparing all the documents, then performing addition, deletion and replacement operations, and selecting the scheme with the least operation steps, wherein the time code information of the identification result is added to the standard document under the scheme. If the continuous texts are inconsistent, the pinyin of the identification result can be compared with the pinyin of the standard manuscript through the pinyin dimension, and when the continuous pinyins are consistent, the time code information of the identification result is attached to the corresponding text part of the matched standard manuscript.

Assuming that d [ i, j ] steps are used to represent the minimum number of steps required to convert string s [1 … i ] to string t [1 … j ], then in the most basic case, i.e., when i equals 0, i.e., string s is empty, the corresponding d [0, j ] is incremented by j characters, such that s is converted to t, and when j equals 0, i.e., string t is empty, the corresponding d [ i,0] is decremented by i characters, such that s is converted to t.

In particular, a two-dimensional array may be used to store the value d [ i, j ].

Next, the embodiment of the present application adds a point dynamic programming concept on the basis, and in order to obtain that s [1.. i ] is converted into t [1.. j ] through a minimum number of adding, deleting, or replacing operations, the minimum number of adding, deleting, or replacing operations must be performed before s [1.. i ], so that the conversion from s [1.. i ] to t [1.. j ] can be completed only by performing one more operation or not performing the operation at present. The so-called "before" is divided into the following three cases:

1) converting s [1 … i ] to t [1 … j-1] in k operations;

2) converting s [1.. i-1] to t [1.. j ] in k operations;

3) converting s [1 … i-1] to t [1 … j-1] in k steps;

for case 1, the matching is done by simply adding t [ j ] to s [1.. i ] at the end, so that a total of k +1 operations are required.

For case 2, s [ i ] only needs to be removed at the end, and then the k operations are done, so a total of k +1 operations are needed.

In the case of case 3, it is only necessary to replace s [ i ] with t [ j ] at the end, so that s [1.. i ] ═ t [1.. j ] is satisfied, which requires a total of k +1 operations. If in case 3 s i is exactly equal to t j, this can be done using only k operations.

Finally, in order to ensure that the obtained operation times are always the minimum, the embodiment of the present application may select the least consumed one from the above three cases as the minimum operation time required for converting s [1.. i ] into t [1.. j ].

Because fuzzy sound exists in the audio file, after the minimum operation frequency is obtained, the relevant words which are not matched with records before and after the word at the wrong position can be further judged based on the fuzzy syllables, whether the word is wrong due to the fuzzy sound is determined through pronunciation, and therefore the operation frequency is further corrected.

Specifically, the search determination needs to be performed in order if there is no related word matching the record.

For example, the following steps are carried out:

a (standard manuscript): the people in winter in late year are really big in snow, so called Rui Xue Mega Feng year is good at

B (recognition result): in winter in the past year, the snow is really big, bright, snow and good at home

The first matching method is as follows: add, delete, replace and coexist (□: representing a place to change)

B (recognition result): □ □ winter snowfall □ is □ □ Rui snow in the year after year, and □ is good megahead □

Number of errors: 7

The second matching method is as follows: pure replacement (inclined font represents the place of correction)

B (recognition result):

number of errors: 25, all the characters are replaced, the original characters are not reserved, and the time code cannot be added

The embodiment of the application finally selects the modification operation with the least error number, and determines the final result:

the number of operation steps 7; reserving time codes of unmodified positions; outputting a result after the matching processing of the embodiment of the application;

step 3, adding time code

In the scheme of the minimum number of operation steps calculated in step 2, the character information (time code) included in the recognition result is added to the standard manuscript, and the time code result is as follows (underlined characters are characters with time code information):

a (standard manuscript): big toyGood for the years Winter snow in this yearTo obtainIs really largeSo-calledRuixiao megafeng year Is a good sign Head with a rotatable shaftO

B (recognition result): □ □Good for the years Winter snow in this year□Is really large□□Ruixiao megafeng year□Is one is Megahead□

Step 4, artificial tolerance

The program producer can carry out the operations of integral deviation modification, detail adjustment, caption retention time extension, sentence break optimization of characters required by radio and television, and the like.

Step 5, outputting the result

Srt + txt caption files are output, and the caption content can be further played according to the time code information.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A subtitle generating method is characterized by comprising the following steps:

determining an audio file of a program;

matching the text data with the standard manuscript of the program;

2. The method of claim 1, wherein matching the text data to a standard manuscript for the program comprises:

3. The method of claim 2, wherein the determining a minimum number of operations and the operating step of matching characters in the text data with characters in a standard manuscript of the program comprises:

4. The method according to claim 3, wherein the determining, for the ith character in the text data, the number of operations min (d [ i, j ]) for determining the character string s [ 1-i ] ═ of the text data as the character string t [ 1-j ] of the standard manuscript and the corresponding operation step comprises:

5. The method of claim 2, after determining the minimum number of operations, further comprising:

judging the relevant words which are not matched with the records before and after the word with the wrong position according to the fuzzy syllables;

determining whether the word in the wrong position is wrong due to fuzzy syllables through pronunciation;

6. The method of claim 2, wherein the operating step comprises adding characters, deleting characters, and/or replacing characters to locations in the text data that do not match characters in a standard manuscript of the program.

7. The method of claim 1, wherein the transcribing the audio file into words to obtain text data with time code information corresponding to the audio file comprises:

recognizing each frame of voice of the audio file into a state sequence;

generating one or more words from the plurality of phonemes;

matching the one or more words with each frame of voice content to obtain the relative time position of the voice clip corresponding to each word on a time axis;

8. A subtitle generating apparatus, comprising:

9. The apparatus of claim 8, wherein the matching module comprises:

10. The apparatus of claim 9, wherein the operation determination unit comprises:

11. The apparatus according to claim 10, wherein the character manipulation subunit is specifically configured to:

12. The apparatus of claim 9, further comprising:

the fuzzy syllable correction module is used for judging the associated words which are not matched with the records before and after the word at the wrong position according to the fuzzy syllables after determining the minimum operation times; determining whether the word in the wrong position is wrong due to fuzzy syllables through pronunciation; and when the word at the wrong position is determined to be wrong due to the fuzzy syllables, correcting the word at the wrong position according to the fuzzy syllables.

13. The apparatus of claim 9, wherein the action comprises adding characters, deleting characters, and/or replacing characters in the text data at positions that do not match characters in a standard manuscript of the program.

14. The apparatus of claim 8, wherein the text generation module comprises:

15. A computer storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

16. An electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method of any of claims 1 to 7.