CN112652329B

CN112652329B - Text realignment method and device, electronic equipment and storage medium

Info

Publication number: CN112652329B
Application number: CN202011248303.7A
Authority: CN
Inventors: 徐文铭; 刘敬晖; 杨晶生; 韩晓; 杜春赛
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2022-03-18
Anticipated expiration: 2040-11-10
Also published as: CN112652329A

Abstract

The disclosure provides a text realignment method, a text realignment device, an electronic device and a storage medium. One embodiment of the method comprises: acquiring a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence and an edited subtitle text which correspond to the pre-editing subtitle text; performing word segmentation processing on the edited subtitle text to obtain an edited subtitle word segmentation sequence; determining a pre-editing caption word corresponding to each post-editing caption word in the post-editing caption text in the pre-editing caption text by using a minimum editing distance algorithm; and for each edited subtitle word segmentation in the edited subtitle word segmentation sequence, determining an edited subtitle word segmentation corresponding to the edited subtitle word segmentation in the edited subtitle word segmentation sequence according to an edited subtitle word before the edited subtitle word corresponding to the edited subtitle word included in the edited subtitle word segmentation, and determining the edited subtitle word segmentation time of the edited subtitle word after the edited subtitle word is edited according to the determined edited subtitle word segmentation time of the edited subtitle word before the edited subtitle word. This embodiment simplifies the complexity of text realignment operations.

Description

Text realignment method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of voice recognition, in particular to a text realignment method, a text realignment device, electronic equipment and a storage medium.

Background

The audio and text alignment in Automatic Speech Recognition (ASR) is a technology widely applied to the field of audio and video, and mainly realizes matching and aligning of a Recognition text obtained after Automatic Recognition of Speech data and the Speech data in a time dimension so as to obtain subtitles. However, errors often occur in the automatic speech recognition process, and a user may edit the recognized text, and the text edited by the user may be different from the text before editing, so that the edited text and the speech data need to be realigned to obtain the realigned subtitles.

Disclosure of Invention

The embodiment of the disclosure provides a text realignment method and device, an electronic device and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a text realignment method, including: acquiring a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence and an edited subtitle text which correspond to the pre-editing subtitle text; performing word segmentation processing on the edited subtitle text to obtain an edited subtitle word segmentation sequence; determining a pre-editing caption word corresponding to each post-editing caption word in the post-editing caption text in the pre-editing caption text by using a minimum editing distance algorithm; and for each edited subtitle word segmentation in the edited subtitle word segmentation sequence, determining an edited subtitle word segmentation corresponding to the edited subtitle word segmentation in the edited subtitle word segmentation sequence according to an edited subtitle word corresponding to the edited subtitle word included in the edited subtitle word segmentation, and determining the edited subtitle word segmentation time of the edited subtitle word segmentation according to the determined edited subtitle word segmentation time of the edited subtitle word segmentation.

In some optional embodiments, the pre-editing subtitle text, the pre-editing subtitle word segmentation sequence, the pre-editing subtitle word segmentation time sequence, and the post-editing subtitle text are generated as follows: carrying out voice recognition on voice data to be recognized to obtain a subtitle text before editing; generating the pre-editing subtitle word segmentation sequence and the pre-editing subtitle time sequence based on the pre-editing subtitle text and the voice data to be recognized; and in response to detecting the editing operation of the user on the pre-editing subtitle text, determining the subtitle text after the editing operation as the edited subtitle text.

In some optional embodiments, the method further comprises: and generating an edited subtitle word segmentation time sequence corresponding to the edited subtitle word segmentation sequence by using the edited subtitle word segmentation time corresponding to each edited subtitle word segmentation in the edited subtitle word segmentation sequence.

In some optional embodiments, the method further comprises: and generating an edited subtitle based on the text structure information of the edited subtitle text, the edited subtitle word segmentation sequence and the edited subtitle time sequence.

In some optional embodiments, the method further comprises: and adding the edited subtitles to the target video to obtain the subtitle video after the addition and the editing.

In some optional embodiments, the pre-editing subtitle text, the pre-editing subtitle word segmentation sequence, the pre-editing subtitle word segmentation time sequence, and the post-editing subtitle text are generated as follows: performing voice recognition on audio data in a target video to obtain a subtitle text before editing; generating the pre-editing subtitle word segmentation sequence and the pre-editing subtitle time sequence based on the pre-editing subtitle text and the voice data to be recognized; and in response to detecting the editing operation of the user on the pre-editing subtitle text, determining the subtitle text after the editing operation as the edited subtitle text.

In some optional embodiments, the performing a word segmentation process on the edited subtitle text to obtain the edited subtitle word segmentation sequence includes: determining the language type of the edited subtitle text; and performing word segmentation processing on the edited subtitle text according to a word segmentation processing method corresponding to the language type to obtain the edited subtitle word segmentation sequence.

In some optional embodiments, the determining, according to a pre-editing subtitle word corresponding to a post-editing subtitle word included in the post-editing subtitle word segmentation, a pre-editing subtitle word segmentation corresponding to the post-editing subtitle word segmentation in the pre-editing subtitle word segmentation sequence, and determining, according to a pre-editing subtitle word segmentation time of the determined pre-editing subtitle word segmentation, a post-editing subtitle word segmentation time of the post-editing subtitle word segmentation includes: determining whether the edited subtitle words corresponding to the edited subtitle words included in the edited subtitle word belong to the same edited subtitle word or not; in response to determining yes, a pre-edit caption time of the pre-edit caption participle is determined as a post-edit caption time of the post-edit caption participle.

In some optional embodiments, the determining, according to a pre-editing subtitle word corresponding to a post-editing subtitle word included in the post-editing subtitle word segmentation, a pre-editing subtitle word segmentation corresponding to the post-editing subtitle word segmentation in the pre-editing subtitle word segmentation sequence, and determining, according to a pre-editing subtitle word segmentation time of the determined pre-editing subtitle word segmentation, a post-editing subtitle word segmentation time of the post-editing subtitle word segmentation further includes: and responding to the judgment result, respectively determining the pre-editing caption participles corresponding to each post-editing caption word in the post-editing caption participles, and determining the post-editing caption time of the post-editing caption participles according to the determined pre-editing caption time of each pre-editing caption participle.

In some optional embodiments, the generating an edited subtitle based on the text structure information of the edited subtitle text, the edited subtitle word segmentation sequence, and the edited subtitle time sequence includes: for each edited subtitle word segmentation in the edited subtitle word segmentation sequence, determining text structure information which is closest to the position of the edited subtitle word segmentation in the edited subtitle text as punctuation marks and position information corresponding to the edited subtitle word segmentation; and generating the edited subtitles by using each edited subtitle word in the edited subtitle word segmentation sequence, and punctuation marks, format information and edited subtitle time corresponding to the edited subtitle word.

In a second aspect, embodiments of the present disclosure provide a text realignment apparatus, including: an acquisition unit configured to acquire a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence, and an post-editing subtitle text corresponding to the pre-editing subtitle text; the word segmentation unit is configured to perform word segmentation processing on the edited subtitle text to obtain an edited subtitle word segmentation sequence; a first determining unit configured to determine, for each edited subtitle word in the edited subtitle text, a pre-edited subtitle word corresponding to the edited subtitle word in the pre-edited subtitle text using a minimum editing distance algorithm; a second determining unit configured to determine, for each edited subtitle participle in the edited subtitle participle sequence, an edited subtitle participle corresponding to an edited subtitle word included in the edited subtitle participle sequence according to an edited subtitle word corresponding to the edited subtitle word, and determine an edited subtitle participle time of the edited subtitle participle according to the determined edited subtitle participle time of the edited subtitle participle.

In some optional embodiments, the apparatus further comprises: a first generating unit configured to generate an edited subtitle word segmentation time series corresponding to the edited subtitle word segmentation sequence using an edited subtitle word segmentation time corresponding to each edited subtitle word segmentation in the edited subtitle word segmentation sequence.

In some optional embodiments, the apparatus further comprises: a second generating unit configured to generate an edited subtitle based on the text structure information of the edited subtitle text, the edited subtitle word segmentation sequence, and the edited subtitle time sequence.

In some optional embodiments, the apparatus further comprises: and the subtitle adding unit is configured to add the edited subtitles to the target video to obtain an added edited subtitle video.

In some optional embodiments, the word segmentation unit is further configured to: determining the language type of the edited subtitle text; and performing word segmentation processing on the edited subtitle text according to a word segmentation processing method corresponding to the language type to obtain the edited subtitle word segmentation sequence.

In some optional embodiments, the second determining unit is further configured to: determining whether the edited subtitle words corresponding to the edited subtitle words included in the edited subtitle word belong to the same edited subtitle word or not; in response to determining yes, a pre-edit caption time of the pre-edit caption participle is determined as a post-edit caption time of the post-edit caption participle.

In some optional embodiments, the second determining unit is further configured to: and responding to the judgment result, respectively determining the pre-editing caption participles corresponding to each post-editing caption word in the post-editing caption participles, and determining the post-editing caption time of the post-editing caption participles according to the determined pre-editing caption time of each pre-editing caption participle.

In some optional embodiments, the second generating unit is further configured to: for each edited subtitle word segmentation in the edited subtitle word segmentation sequence, determining text structure information which is closest to the position of the edited subtitle word segmentation in the edited subtitle text as punctuation marks and position information corresponding to the edited subtitle word segmentation; and generating the edited subtitles by using each edited subtitle word in the edited subtitle word segmentation sequence, and punctuation marks, format information and edited subtitle time corresponding to the edited subtitle word.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method as described in any of the implementations of the first aspect.

Existing audio and text realignment strategies are mainly based on supervised ASR acoustic model algorithms or require manual time to annotate the corresponding words or sentences. The alignment algorithm based on the acoustic model requires training the acoustic model based on a large amount of labeled audio and text information in advance, and the operation is complex. And manual labeling is high in labor cost.

According to the text realignment method, the text realignment device, the electronic device and the storage medium, each participle in the edited subtitle participle sequence is matched with a participle in the edited subtitle participle sequence by using a minimum editing distance algorithm, and the edited subtitle time corresponding to the edited subtitle participle is generated according to the subtitle time corresponding to the edited subtitle participle obtained by matching, so that the unsupervised alignment of the edited subtitle text and the character level of voice is realized by using the existing original time information of the original subtitle text without involving an acoustic model, and the operation is simplified.

Drawings

Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are only for purposes of illustrating the particular embodiments and are not to be construed as limiting the invention. In the drawings:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a text realignment method according to the present disclosure;

FIG. 3 is a flow diagram of yet another embodiment of a text realignment method according to the present disclosure;

FIG. 4 is a schematic structural diagram of one embodiment of a document realignment arrangement according to the present disclosure;

FIG. 5 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the text realignment methods, apparatuses, electronic devices, and storage media of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a text processing application, a voice recognition application, a short video social application, a web conference application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a sound collecting device (e.g. a microphone), a video collecting device (e.g. a camera), and a display screen, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop portable computer, a desktop computer, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-listed terminal apparatuses. It may be implemented as multiple software or software modules (e.g., to provide text realignment services), or as a single software or software module. And is not particularly limited herein.

In some cases, the text realignment method provided by the present disclosure may be performed by the

terminal devices

101, 102, 103, and accordingly, the text realignment means may be provided in the

terminal devices

101, 102, 103. In this case, the system architecture 100 may not include the server 105.

In some cases, the text realignment method provided by the present disclosure may be executed by the

terminal devices

101, 102, and 103 and the server 105 together, for example, the steps of "obtaining a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence, and a post-editing subtitle text corresponding to the pre-editing subtitle text" may be executed by the

terminal devices

101, 102, and 103, and the steps of "performing word segmentation processing on the post-editing subtitle text to obtain a post-editing subtitle word segmentation sequence" may be executed by the server 105. The present disclosure is not limited thereto. Accordingly, the text realignment means may be provided in the

terminal apparatuses

101, 102, 103 and the server 105, respectively.

In some cases, the text realignment method provided by the present disclosure may be executed by the server 105, and accordingly, the text realignment apparatus may also be disposed in the server 105, and in this case, the system architecture 100 may not include the

terminal devices

101, 102, and 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a text realignment method according to the present disclosure is shown, the text realignment method including the steps of:

step 201, acquiring a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence and an edited subtitle text corresponding to the pre-editing subtitle text.

In the present embodiment, the execution subject of the text realignment method (for example, the

terminal devices

101, 102, 103 shown in fig. 1) may first acquire the pre-editing subtitle text, and the pre-editing subtitle word segmentation sequence, the pre-editing subtitle word segmentation time sequence, and the post-editing subtitle text corresponding to the pre-editing subtitle text.

Here, the pre-editing subtitle word segmentation sequence may be a word segmentation sequence obtained by performing word segmentation processing on a pre-editing subtitle text. And the pre-editing subtitle word segmentation time sequence may be a sequence corresponding to the pre-editing subtitle word segmentation sequence and used for representing the subtitle start time of each pre-editing subtitle word segmentation in the pre-editing subtitle word segmentation sequence. The pre-editing subtitle word segmentation time may take various forms. For example, the pre-editing subtitle word segmentation time may be represented by a timestamp.

In some alternative embodiments, the pre-edit subtitle text may be a recognized text obtained by performing automatic speech recognition on the speech data to be recognized. Correspondingly, word segmentation processing can be carried out on the subtitle text before editing to obtain a subtitle word segmentation sequence before editing, and text and audio alignment is carried out on the subtitle word segmentation sequence before editing and the voice data to be recognized to obtain a subtitle time sequence before editing. It should be noted that, the time sequence of the subtitle before editing obtained by aligning the text and the audio based on the subtitle word segmentation sequence before editing and the voice data to be recognized may be implemented in various ways, which is not limited herein. For example, an acoustic model-based approach or a manual labeling-based approach may be employed. Based on the optional implementation mode, automatic voice recognition can be performed on the voice to be recognized to obtain a pre-editing subtitle text, a pre-editing subtitle word segmentation sequence and a pre-editing subtitle time sequence, and on the basis, the steps 201 to 204 are executed, so that the edited subtitle text can be realigned to the voice data to be recognized, and the matching degree between the edited subtitle and the voice data to be recognized is improved.

Here, the edited subtitle text may be a text obtained by manually editing and modifying the edited subtitle text. Alternatively, the edited subtitle text may be a text processed by a Natural Language Processing (NLP) model.

Step 202, performing word segmentation processing on the edited subtitle text to obtain an edited subtitle word segmentation sequence.

In this embodiment, the execution main body may perform word segmentation processing on the edited subtitle text acquired in step 201 by using various implementation manners, so as to obtain an edited subtitle word segmentation sequence. For example, a word segmentation method based on string matching, a word segmentation method based on understanding, or a word segmentation method based on statistics, etc. may be employed.

In some alternative embodiments, step 202 may be performed as follows:

first, the language type of the edited subtitle text is determined.

In practice, when the edited subtitle text includes two or more language types, the language type with the higher priority of the two or more language types may be determined as the language type of the edited subtitle text according to a preset language type priority rule. For example, when both chinese and english are included in the edited subtitle text, chinese may be determined as the language type of the edited subtitle text.

And then, performing word segmentation processing on the edited subtitle text according to a word segmentation processing method corresponding to the language type to obtain an edited subtitle word segmentation sequence.

In practice, because the words of different language types are expressed in different ways, the word segmentation method may be different accordingly. Therefore, after the language type is determined, the edited subtitle text can be subjected to word segmentation processing according to a corresponding word segmentation processing method, and an edited subtitle word segmentation sequence is obtained. For example, a third party chinese tokenization library jieba in the Python base library may be employed to tokenize edited subtitle text of chinese type. The edited subtitle text of the japanese type can be participled by mecab (japanese word segmentation system developed by gongtui of the academy of advanced science and technology, neal). And for the edited subtitle text of the English type, word segmentation based on blank spaces can be adopted.

As an example, "i like xiao ming for the edited subtitle text. "the edited caption word segmentation sequence can be obtained by word segmentation", i/i like/small/bright/. "

According to the optional implementation mode, the method and the device can realize targeted word segmentation of the edited text, and improve the word segmentation accuracy.

And step 203, determining a pre-editing caption word corresponding to each post-editing caption word in the pre-editing caption text by using a minimum editing distance algorithm.

In order to realign the edited subtitle text to the original voice corresponding to the subtitle text before editing, that is, to determine the subtitle time of a word or a word in the edited subtitle text, the applicant has found through practical research and analysis that, because the subtitle text before editing has a corresponding subtitle word segmentation time sequence before editing, if the subtitle text after editing can be realigned by using the subtitle word segmentation time sequence before editing, the subtitle text after editing can be realigned to the original voice corresponding to the subtitle text before editing, and the calculation amount and the calculation complexity are greatly reduced compared with a method of realigning by using an acoustic model. However, to implement the word segmentation time sequence of the subtitle before editing, the edited text and the text before editing need to be aligned first, so the applicant wants to implement the alignment operation of the edited text and the text before editing by using the minimum editing distance method.

The minimum Edit Distance (Edit Distance), also called Levenshtein Distance, refers to the minimum number of Edit operations required to convert one string into another string between two strings. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. In the process of calculating the minimum editing distance between the two word strings, various dynamic programming algorithms can be adopted, and the dynamic programming algorithm essentially needs to find a recursive description of the relationship between the edited subtitle text and the pre-edited subtitle text, so that the corresponding relationship between each edited subtitle word in the edited subtitle text and each pre-edited subtitle word in the pre-edited subtitle text can be obtained based on the minimum editing distance algorithm.

It should be noted that the edited pre-letter words in the edited pre-subtitle text and the edited post-subtitle words in the edited post-subtitle text may be different for different language types. For example, for chinese and japanese, each word in the pre-edit subtitle text is a pre-edit subtitle word, and correspondingly each word in the post-edit subtitle text is a post-edit subtitle word. For english, each word (word) in the subtitle text before editing is used as the subtitle word before editing, and correspondingly, each word (word) in the subtitle text after editing is also used as the subtitle word after editing.

Step 204, for each edited subtitle word in the edited subtitle word segmentation sequence, determining an edited pre-subtitle word segmentation corresponding to the edited subtitle word in the edited subtitle word segmentation sequence according to an edited pre-subtitle word corresponding to the edited subtitle word included in the edited subtitle word segmentation, and determining an edited post-subtitle word segmentation time of the edited subtitle word according to the determined edited pre-subtitle word segmentation time of the edited subtitle word.

In practice, in most cases, the pre-editing subtitle words corresponding to each post-editing subtitle word included in the post-editing subtitle word also belong to the same pre-editing subtitle word. When the pre-editing subtitle words corresponding to each post-editing subtitle word included in the post-editing subtitle word also belong to the same pre-editing subtitle word, the pre-editing subtitle word can be determined as the pre-editing subtitle word corresponding to the post-editing subtitle word, and the pre-editing subtitle word segmentation time of the determined pre-editing subtitle word can be determined as the post-editing subtitle word segmentation time of the post-editing subtitle word. Here, the edited subtitle word segmentation time of the edited subtitle word segmentation may be used to represent a start time of the edited subtitle word segmentation.

When the pre-editing subtitle words corresponding to each post-editing subtitle word included in the post-editing subtitle word belong to different pre-editing subtitle words, the different pre-editing subtitle words may be at least two pre-editing subtitle words (for example, at least two continuous pre-editing subtitle words) in a pre-editing subtitle word sequence in practice. In this case, the execution body may determine a pre-editing subtitle word corresponding to each post-editing subtitle word in the post-editing subtitle word, and determine a post-editing subtitle time of the post-editing subtitle word according to the determined pre-editing subtitle time of each pre-editing subtitle word. For example, if the at least two pre-edit subtitle segments are symbolized by W1, W2, …, Wn respectively in the order of appearance in the pre-edit subtitle segment sequence, where W1 is before and W2 is after, the post-edit subtitle time of the post-edit subtitle segment may be determined as the pre-edit subtitle time of any one of the pre-edit subtitle segments W2, …, Wn. For example, the post-editing caption time of the post-editing caption word may be determined as the middle time of the pre-editing caption word specifying times W1, W2, …, Wn. For another example, the post-edit subtitle time of the post-edit subtitle word may be determined as an average time of the pre-edit subtitle times of the pre-edit subtitle words W1, W2, …, Wn. Namely, the post-editing subtitle word segmentation time of the post-editing subtitle word segmentation is subjected to compromise processing according to the subtitle time of the corresponding pre-editing subtitle word segmentation, so that alignment between the post-editing subtitle word segmentation and the original voice can be more matched. It should be noted that, here, the subtitle word segmentation time before editing and the subtitle word segmentation time after editing are respectively the start time of subtitle word segmentation before editing and the start time of subtitle word segmentation after editing.

According to the text realignment method provided by the embodiment of the disclosure, each participle in the edited subtitle participle sequence is matched with a participle in the edited subtitle participle sequence before the subtitle is edited by using a minimum editing distance algorithm, and the edited subtitle time corresponding to the edited subtitle participle is generated according to the subtitle time corresponding to the edited subtitle participle before the subtitle is matched, so that unsupervised alignment of the edited subtitle text and the character level of the voice is realized by using the existing original time information of the original subtitle text without involving an acoustic model, and the operation is simplified.

With continued reference to fig. 3, a flow 300 of yet another embodiment of a text realignment method according to the present disclosure is shown. The text realignment method comprises the following steps:

step 301, acquiring audio data in the target video as voice data to be recognized.

In this embodiment, the execution subject of the text realignment method (e.g., the

terminal devices

101, 102, 103 shown in fig. 1) may locally acquire the target video.

For example, the execution main body may acquire video data in real time from a video acquisition device (e.g., a camera) in data communication with the execution main body every preset time period (e.g., two seconds), and then may take the video data acquired in real time as a target video.

For another example, the execution main body may acquire a locally stored video as the target video.

In this embodiment, an execution subject (e.g., a server shown in fig. 1) of the text realignment method may also or remotely acquire the target video from other electronic devices (e.g.,

terminal devices

101, 102, 103 shown in fig. 1) connected to the execution subject through a network.

Then, the executing body may acquire the audio data in the target video as the voice data to be recognized again.

And step 302, performing voice recognition on voice data to be recognized to obtain a subtitle text before editing.

Here, various existing or future developed voice recognition methods may be adopted to perform voice recognition on the voice data to be recognized acquired in step 301 to obtain the subtitle text before editing, and the specific voice recognition method adopted in the present application is not specifically limited.

And step 303, generating a pre-editing subtitle word segmentation sequence and a pre-editing subtitle time sequence based on the pre-editing subtitle text and the voice data to be recognized.

The execution main body may first perform word segmentation processing on the pre-editing subtitle text to obtain a pre-editing subtitle word segmentation sequence, and perform text and audio alignment on the pre-editing subtitle word segmentation sequence and the to-be-recognized voice data to obtain a pre-editing subtitle time sequence. It should be noted that, the time sequence of the subtitle before editing obtained by aligning the text and the audio based on the subtitle word segmentation sequence before editing and the voice data to be recognized may be implemented in various ways, which is not limited herein. For example, an acoustic model-based approach or a manual labeling-based approach may be employed. The pre-editing subtitle word segmentation time sequence may be a sequence corresponding to the pre-editing subtitle word segmentation sequence and used for representing the subtitle start time of each pre-editing subtitle word segmentation in the pre-editing subtitle word segmentation sequence. The pre-editing subtitle word segmentation time may take various forms. For example, the pre-editing subtitle word segmentation time may be represented by a timestamp.

Step 304, presenting the pre-editing subtitle text.

In this embodiment, the execution subject may present the pre-editing subtitle text obtained in step 502. For example, it may be presented on a display in data communication connection with the execution body described above.

Step 305, in response to detecting the editing operation of the user on the pre-editing subtitle text, determining the subtitle text after the editing operation as the post-editing subtitle text.

Here, the execution body may provide an interface for a user to perform an editing operation on the pre-editing subtitle text, and determine the subtitle text after the editing operation as the post-editing subtitle text when the editing operation of the user on the pre-editing subtitle text is monitored.

It should be noted that, the executing entity may directly execute step 304 and step 305 after executing step 302 and then execute step 303, or may execute step 304 and step 305 after executing step 303, and the present application is not limited in particular.

And step 306, performing word segmentation processing on the edited subtitle text to obtain an edited subtitle word segmentation sequence.

Step 307, determining, by using a minimum edit distance algorithm, for each edited subtitle word in the edited subtitle text, a pre-edited subtitle word corresponding to the edited subtitle word in the pre-edited subtitle text.

Step 308, for each edited subtitle word in the edited subtitle word segmentation sequence, determining an edited subtitle word segmentation before corresponding to the edited subtitle word included in the edited subtitle word segmentation sequence according to the edited subtitle word before corresponding to the edited subtitle word included in the edited subtitle word segmentation, and determining an edited subtitle word segmentation time of the edited subtitle word according to the determined edited subtitle word segmentation before the edited subtitle word.

In the present embodiment, the specific operations of step 306, step 307, and step 308 and the technical effects thereof are substantially the same as the operations and effects of step 202, step 203, and step 204 in the embodiment shown in fig. 2, and are not repeated herein.

Step 309, generating an edited subtitle word segmentation time sequence corresponding to the edited subtitle word segmentation sequence by using the edited subtitle word segmentation time corresponding to each edited subtitle word segmentation in the edited subtitle word segmentation sequence.

The edited subtitle word segmentation time corresponding to each edited subtitle word segmentation in the edited subtitle word segmentation sequence is obtained through the step 308, and the edited subtitle word segmentation time sequence corresponding to the edited subtitle word segmentation sequence can be generated by using the edited subtitle word segmentation time corresponding to each edited subtitle word segmentation in the edited subtitle word segmentation sequence according to the appearance sequence of the edited subtitle word segmentation in the edited subtitle word segmentation sequence.

And 310, generating an edited subtitle based on the text structure information of the edited subtitle text, the edited subtitle word segmentation sequence and the edited subtitle time sequence.

In this embodiment, the execution main body may generate the edited subtitle based on the text structure information of the edited subtitle text, the edited subtitle word segmentation sequence, and the edited subtitle time sequence in various implementations.

In practice, besides the edited subtitle text including the participles of the edited subtitle participles, the edited subtitle text often carries various text structure information. Wherein the text structure information may include punctuation and format information (e.g., paragraph structure information, etc.). In order to restore the structural information of the characters, words and texts in the edited subtitle text in the edited subtitle, the execution main body may perform word segmentation processing on the edited subtitle text in the execution step 306 to obtain the edited subtitle word segmentation sequence, or determine, for each edited subtitle word in the edited subtitle word segmentation sequence, the text structural information closest to the position of the edited subtitle word in the edited subtitle text as the text structural information corresponding to the edited subtitle word, and generate the edited subtitle by using each edited subtitle word in the edited subtitle word segmentation sequence, the text structural information corresponding to the edited subtitle word, and the edited subtitle time after the editing. And further completely restoring the edited subtitles corresponding to the edited subtitle text. The edited subtitle includes all characters, words, text structure information (such as punctuation and format information) in the edited subtitle text and corresponding subtitle time.

And 311, adding the edited subtitles to the target video to obtain the subtitle video after editing is added.

In step 310, the edited subtitle is already obtained, and here, the execution subject may add the character, word, and text structure information in the edited subtitle to the corresponding time in the target video according to the edited subtitle time of the corresponding edited subtitle segmentation, so as to obtain the added edited subtitle video. Therefore, the subtitle text edited by the user is realigned and added into the target video, and the subtitle and video content in the target video are more adaptive.

As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the process 300 of the text realignment method in this embodiment highlights that the user can edit the subtitle text obtained after performing voice recognition on the voice data in the target video, and realign the edited subtitle text to the target video and add the subtitle text to the target video. Therefore, the scheme described in this embodiment can achieve realignment of the subtitle text edited by the user to the target video without adopting a complex acoustic model or performing alignment through manual annotation, so that the matching degree of the target video and the subtitle content is improved, the labor cost is reduced compared with manual annotation realignment, the computational complexity is reduced compared with a method based on an acoustic model, and the user experience is improved.

With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a text realignment apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 4, the text realignment device 400 of the present embodiment includes: an acquisition unit 401, a word segmentation unit 402, a first determination unit 403 and a second determination unit 404. The acquiring unit 401 is configured to acquire a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence, and an edited subtitle text corresponding to the pre-editing subtitle text; a word segmentation unit 402 configured to perform word segmentation processing on the edited subtitle text to obtain an edited subtitle word segmentation sequence; a first determining unit 403 configured to determine, for each edited subtitle word in the edited subtitle text, a pre-edited subtitle word corresponding to the edited subtitle word in the pre-edited subtitle text using a minimum editing distance algorithm; a second determining unit 404, configured to determine, for each edited subtitle participle in the edited subtitle participle sequence, an edited subtitle participle corresponding to an edited subtitle word included in the edited subtitle participle sequence according to an edited subtitle word corresponding to the edited subtitle word, and determine an edited subtitle participle time of the edited subtitle participle according to the determined edited subtitle participle time of the edited subtitle participle.

In this embodiment, specific processes of the obtaining unit 401, the word segmentation unit 402, the first determination unit 403, and the second determination unit 404 of the text realignment apparatus 400 and technical effects brought by the specific processes may respectively refer to relevant descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional embodiments, the apparatus 400 may further include: a first generating unit 405 configured to generate an edited subtitle word segmentation time series corresponding to the edited subtitle word segmentation series by using an edited subtitle word segmentation time corresponding to each edited subtitle word segmentation in the edited subtitle word segmentation series.

In some optional embodiments, the apparatus 400 may further include: a second generating unit 406 configured to generate an edited subtitle based on the text structure information of the edited subtitle text, the edited subtitle word segmentation sequence, and the edited subtitle time sequence.

In some optional embodiments, the apparatus may further include: and a subtitle adding unit 407 configured to add the edited subtitle to the target video, resulting in an added edited subtitle video.

In some optional embodiments, the pre-editing subtitle text, the pre-editing subtitle word segmentation sequence, the pre-editing subtitle word segmentation time sequence, and the post-editing subtitle text may be generated as follows: performing voice recognition on audio data in a target video to obtain a subtitle text before editing; generating the pre-editing subtitle word segmentation sequence and the pre-editing subtitle time sequence based on the pre-editing subtitle text and the voice data to be recognized; and in response to detecting the editing operation of the user on the pre-editing subtitle text, determining the subtitle text after the editing operation as the edited subtitle text.

In some optional embodiments, the word segmentation unit 402 may be further configured to: determining the language type of the edited subtitle text; and performing word segmentation processing on the edited subtitle text according to a word segmentation processing method corresponding to the language type to obtain the edited subtitle word segmentation sequence.

In some optional embodiments, the second determining unit 404 may be further configured to: determining whether the edited subtitle words corresponding to the edited subtitle words included in the edited subtitle word belong to the same edited subtitle word or not; in response to determining yes, a pre-edit caption time of the pre-edit caption participle is determined as a post-edit caption time of the post-edit caption participle.

In some optional embodiments, the second determining unit 404 may be further configured to: and responding to the judgment result, respectively determining the pre-editing caption participles corresponding to each post-editing caption word in the post-editing caption participles, and determining the post-editing caption time of the post-editing caption participles according to the determined pre-editing caption time of each pre-editing caption participle.

In some optional embodiments, the second generating unit 406 may be further configured to: for each edited subtitle word segmentation in the edited subtitle word segmentation sequence, determining text structure information which is closest to the position of the edited subtitle word segmentation in the edited subtitle text as punctuation marks and position information corresponding to the edited subtitle word segmentation; and generating the edited subtitles by using each edited subtitle word in the edited subtitle word segmentation sequence, and punctuation marks, format information and edited subtitle time corresponding to the edited subtitle word.

It should be noted that, for details of implementation and technical effects of each unit in the text realignment apparatus provided in the embodiment of the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not described herein again.

Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing the electronic device of the present disclosure is shown. The computer system 500 shown in fig. 5 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.

As shown in fig. 5, computer system 500 may include a processing device (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage device 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the computer system 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, and the like; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the computer system 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates a computer system 500 having various means of electronic equipment, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the text realignment method as shown in the embodiment shown in fig. 2 and its alternative embodiments, and/or the text realignment method as shown in the embodiment shown in fig. 3 and its alternative embodiments.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Here, the name of a unit does not constitute a limitation of the unit itself in some cases, and for example, the acquisition unit may also be described as a "unit that acquires a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence, and a post-editing subtitle text corresponding to the pre-editing subtitle text".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A text realignment method, comprising:

acquiring a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence and an edited subtitle text which correspond to the pre-editing subtitle text;

performing word segmentation processing on the edited subtitle text to obtain an edited subtitle word segmentation sequence;

determining a pre-editing caption word corresponding to each post-editing caption word in the pre-editing caption text by using a minimum editing distance algorithm;

and for each edited subtitle word segmentation in the edited subtitle word segmentation sequence, determining an edited pre-subtitle word segmentation corresponding to the edited subtitle word segmentation in the edited pre-subtitle word segmentation sequence according to an edited pre-subtitle word corresponding to the edited subtitle word included in the edited subtitle word segmentation, judging whether the edited post-subtitle word included in the edited subtitle word segmentation corresponds to the same determined edited pre-subtitle word segmentation, and determining the edited post-subtitle word segmentation time of the edited post-subtitle word segmentation according to a judgment result and the determined edited pre-subtitle word segmentation time of the edited pre-subtitle word segmentation.

2. The method of claim 1, wherein the pre-edit subtitle text, the pre-edit subtitle word segmentation sequence, the pre-edit subtitle word segmentation time sequence, and the post-edit subtitle text are generated by:

carrying out voice recognition on voice data to be recognized to obtain a subtitle text before editing;

generating the pre-editing subtitle word segmentation sequence and the pre-editing subtitle time sequence based on the pre-editing subtitle text and the voice data to be recognized;

and in response to detecting an editing operation of a user on the pre-editing subtitle text, determining the subtitle text after the editing operation as an edited subtitle text.

3. The method of claim 1, wherein the method further comprises:

and generating an edited subtitle word segmentation time sequence corresponding to the edited subtitle word segmentation sequence by using the edited subtitle word segmentation time corresponding to each edited subtitle word segmentation in the edited subtitle word segmentation sequence.

4. The method of claim 1, wherein the method further comprises:

and generating the edited subtitles based on the text structure information of the edited subtitle text, the edited subtitle word segmentation sequence and the edited subtitle time sequence.

5. The method of claim 4, wherein the method further comprises:

and adding the edited subtitles to a target video to obtain an edited subtitle video.

6. The method of claim 1, wherein the pre-edit subtitle text, the pre-edit subtitle word segmentation sequence, the pre-edit subtitle word segmentation time sequence, and the post-edit subtitle text are generated by:

performing voice recognition on audio data in a target video to obtain a subtitle text before editing;

generating a pre-editing subtitle word segmentation sequence and a pre-editing subtitle time sequence based on the pre-editing subtitle text and the voice data to be recognized, wherein the voice data to be recognized is audio data in the target video;

7. The method according to any one of claims 1 to 6, wherein the performing a word segmentation process on the edited subtitle text to obtain the edited subtitle word segmentation sequence includes:

determining the language type of the edited subtitle text;

and performing word segmentation processing on the edited subtitle text according to a word segmentation processing method corresponding to the language type to obtain an edited subtitle word segmentation sequence.

8. The method according to any one of claims 1 to 6, wherein the determining, according to a pre-editing caption word corresponding to a post-editing caption word included in the post-editing caption word segmentation, a pre-editing caption word corresponding to the post-editing caption word in the pre-editing caption word segmentation sequence, determining whether the post-editing caption word included in the post-editing caption word corresponds to the same determined pre-editing caption word, and determining, according to a result of the determining and the determined pre-editing caption word segmentation time of the pre-editing caption word, a post-editing caption word segmentation time of the post-editing caption word, includes:

determining whether the edited subtitle words corresponding to the edited subtitle words included in the edited subtitle word belong to the same edited subtitle word or not;

in response to determining yes, a pre-edit caption time of the pre-edit caption participle is determined as a post-edit caption time of the post-edit caption participle.

9. The method according to claim 8, wherein the determining, according to a pre-editing caption word corresponding to an post-editing caption word included in the post-editing caption word segmentation, a pre-editing caption word corresponding to the post-editing caption word segmentation in the pre-editing caption word segmentation sequence, determining whether the post-editing caption word included in the post-editing caption word corresponds to the same determined pre-editing caption word segmentation, and determining, according to a determination result and the determined pre-editing caption word segmentation time of the pre-editing caption word, a post-editing caption word segmentation time of the post-editing caption word, further comprises:

and responding to the judgment result, respectively determining the pre-editing caption participles corresponding to each post-editing caption word in the post-editing caption participles, and determining the post-editing caption time of the post-editing caption participles according to the determined pre-editing caption time of each pre-editing caption participle.

10. The method of any of claims 4-6, wherein the generating an edited subtitle based on text structure information of the edited subtitle text, the edited subtitle word segmentation sequence, and the edited subtitle time sequence comprises:

for each edited subtitle word segmentation in the edited subtitle word segmentation sequence, determining text structure information which is closest to the position of the edited subtitle word segmentation in the edited subtitle text as punctuation marks and position information corresponding to the edited subtitle word segmentation;

and generating the edited subtitles by using each edited subtitle word in the edited subtitle word segmentation sequence, and the punctuation marks, the format information and the edited subtitle time corresponding to the edited subtitle word.

11. A text realignment apparatus, comprising:

an acquisition unit configured to acquire a pre-editing subtitle word segmentation sequence, a pre-editing subtitle word segmentation time sequence, and an post-editing subtitle text corresponding to the pre-editing subtitle text;

the word segmentation unit is configured to perform word segmentation processing on the edited subtitle text to obtain an edited subtitle word segmentation sequence;

a first determining unit configured to determine, for each edited subtitle word in the edited subtitle text, a pre-edited subtitle word corresponding to the edited subtitle word in the pre-edited subtitle text using a minimum editing distance algorithm;

a second determining unit, configured to determine, for each edited subtitle participle in the edited subtitle participle sequence, an edited subtitle participle corresponding to an edited subtitle word included in the edited subtitle participle sequence according to an edited pre-subtitle word corresponding to the edited subtitle word included in the edited subtitle participle sequence, determine whether the edited subtitle word included in the edited subtitle participle corresponds to the same determined edited pre-subtitle participle, and determine an edited post-subtitle participle time of the edited subtitle participle according to a determination result and the determined edited pre-subtitle participle time of the edited pre-subtitle participle.

12. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method of any one of claims 1-10.