CN114143613A

CN114143613A - Video subtitle time alignment method, system and storage medium

Info

Publication number: CN114143613A
Application number: CN202111470116.8A
Authority: CN
Inventors: 程梓益
Original assignee: Beijing Moviebook Science And Technology Co ltd
Current assignee: Beijing Moviebook Science And Technology Co ltd
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-03-04
Anticipated expiration: 2041-12-03
Also published as: CN114143613B

Abstract

The application discloses a video subtitle time alignment method, a system and a storage medium. The method comprises the steps of firstly, obtaining an original video with subtitles and a description text, wherein the content of the description text corresponds to the content of the subtitles in the original video; intercepting a subtitle area in an original video according to a preset frame taking interval time to obtain a subtitle area image set; inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp; matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a first sentence and a last sentence of the OCR recognition result in each paragraph; and determining the duration of each paragraph of the description text in the original video according to the timestamp corresponding to the first sentence and the last sentence of each paragraph. Therefore, the technical scheme provided by the embodiment of the application improves the accuracy of time matching between the video subtitles and the description text.

Description

Video subtitle time alignment method, system and storage medium

Technical Field

The present invention relates to the field of multimedia technologies, and in particular, to a method, a system, and a storage medium for video subtitle time alignment.

Background

With the continuous development of internet technology and multimedia technology, video is popular among many users as one of information carriers. In order to better display video content, subtitles corresponding to video are generally displayed simultaneously when a user watches the video, and description text corresponding to the video subtitles also exists, however, the description text is generally divided into several or even more than ten text segments.

In the prior art, when paragraphs in description texts are time-matched with video subtitles, a common method is to use OCR to recognize characters of a current frame in a video, record the current time, and then match the current time with a corresponding text, but the common method cannot automatically complete the task because of wrongly written characters, the existence of uncommon words and the interference of a video background.

Disclosure of Invention

Based on this, the embodiments of the present application provide a method, a system, and a storage medium for video subtitle time alignment, which can improve the accuracy of time matching between a video subtitle and a description text.

In a first aspect, a method for video subtitle time alignment is provided, the method including:

acquiring an original video with subtitles and a description text, wherein the content of the description text corresponds to the content of the subtitles in the original video;

intercepting a subtitle region in the original video according to a preset frame taking interval time to obtain a subtitle region image set, wherein the subtitle region image set comprises a corresponding timestamp in the original video;

inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp;

matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph;

and determining the duration of each paragraph of the description text in the original video according to the timestamp corresponding to the first sentence and the last sentence of each paragraph.

Optionally, matching the OCR recognition result with each paragraph of the description text by using a common substring algorithm, and determining a first sentence of each paragraph of the OCR recognition result, including:

comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a first substring from the common substrings, wherein the first substring is used for representing a first continuous common substring;

when the first sub-string is in a starting character range in the target paragraph, performing character comparison on an OCR recognition result corresponding to the first sub-string and characters in the starting character range;

when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph;

and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.

Optionally, the character comparison between the OCR recognition result and the target paragraph to find out all the continuous common substrings, and after selecting the first substring, the method further includes:

and when the first substring is in the end character range of the target paragraph, taking the time stamp of the OCR recognition result corresponding to the first substring as the starting time of the next paragraph of the target paragraph.

Optionally, matching the OCR recognition result with each paragraph of the description text by a common substring algorithm, and determining a tail sentence of the OCR recognition result in each paragraph, including:

comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a tail substring from the common substrings, wherein the tail substring is used for representing the last continuous common substring;

when the tail string is in the ending character range in the target paragraph, performing character comparison on the OCR recognition result corresponding to the tail string and characters in the ending character range;

when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph;

and traversing each paragraph of the description text, and determining the tail sentence of the OCR recognition result in each paragraph.

Optionally, the character comparison between the OCR recognition result and the target paragraph to find out all the continuous common substrings, and after selecting the tail substring, the method further includes:

and when the tail substring is in the range of the initial characters in the target paragraph, taking the time stamp of the OCR recognition result corresponding to the tail substring as the end time of a section on the target paragraph.

Optionally, inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain a time-stamped OCR recognition result, including:

and checking the OCR recognition result, and matching and storing the OCR recognition result which contains Chinese and has the confidence coefficient larger than a preset threshold value.

Optionally, determining a duration of the description text corresponding to the original video according to timestamps corresponding to a beginning sentence and an end sentence of the description text, further including:

and when the timestamp corresponding to the first sentence is overlapped with the timestamp corresponding to the last sentence, taking the duration after the time ranges are combined as an output result.

Optionally, the description text includes wrongly written words and/or uncommon words.

In a second aspect, there is provided a video subtitle time alignment apparatus, including:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an original video with subtitles and a description text, and the content of the description text corresponds to the content of the subtitles in the original video;

the capturing module is used for capturing a subtitle region in the original video according to a preset frame capturing interval time to obtain a subtitle region image set, wherein the subtitle region image set comprises a corresponding timestamp in the original video;

the recognition module is used for inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp;

the matching module is used for matching the OCR recognition result with each paragraph of the description text through a common substring algorithm to determine a head sentence and a tail sentence of the OCR recognition result in each paragraph;

and the determining module is used for determining the duration of each paragraph of the description text in the original video according to the timestamp corresponding to the head sentence and the tail sentence of each paragraph.

In a third aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the video subtitle time alignment method according to any one of the above-mentioned first aspects.

According to the technical scheme provided by the embodiment of the application, an original video with subtitles and a description text are obtained, wherein the content of the description text corresponds to the content of the subtitles in the original video; intercepting a subtitle area in an original video according to a preset frame taking interval time to obtain a subtitle area image set; inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp; matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a first sentence and a last sentence of the OCR recognition result in each paragraph; and determining the duration of each paragraph of the description text in the original video according to the timestamp corresponding to the first sentence and the last sentence of each paragraph. Therefore, the technical scheme provided by the embodiment of the application solves the problem of time matching of the video subtitles caused by the existence of wrongly written characters, rarely written characters and video background interference, and improves the accuracy of time matching of the video subtitles and the description text.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

Fig. 1 is a flowchart illustrating steps of a video subtitle time alignment method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of an original video with subtitles according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a descriptive text provided by an embodiment of the present application;

fig. 4 is a schematic diagram of a subtitle region image including wrongly written characters according to an embodiment of the present application;

FIG. 5 is a flowchart of steps provided in an alternative embodiment of the present application;

fig. 6 is a block diagram of a video subtitle time alignment system according to an embodiment of the present application.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

To facilitate understanding of the present embodiment, a detailed description will be first given of a video subtitle time alignment method disclosed in the embodiments of the present application.

First, an application scenario in which the embodiment of the present application is located is introduced: giving a video and a description text, the text content corresponds to the subtitles in the video, but the text has wrongly written characters, the text has been divided into segments of about 200 characters each, in order to automatically mark the duration of each segment of text in the video.

Referring to fig. 1, a flowchart of a video subtitle time alignment method provided by an embodiment of the present application is shown, where the method may include the following steps:

step 101, obtaining an original video with subtitles and a description text.

Wherein the content of the descriptive text corresponds to the content of the subtitles in the original video.

In the embodiment of the present application, an original video with subtitles and a description text are obtained as shown in fig. 2 and fig. 3, the text content corresponds to the subtitles of the video, where LF represents a line break displayed by a text editor, as shown in fig. 4, there are many wrongly written characters in the text, as shown in the figure, "mark" of the text should correspond to a "peak" of the subtitles, and there may be a case that matching is unsuccessful when matching is performed by an existing text matching method due to the existence of the wrongly written characters.

And step 102, intercepting a subtitle area in the original video according to a preset frame taking interval time to obtain a subtitle area image set.

And 103, inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp.

The subtitle region image set includes a corresponding timestamp in the original video, and the preset frame taking interval time may be one second.

In the embodiment of the application, a complete original video with subtitles is input, one frame is taken every second, the subtitle area is intercepted every frame, OCR recognition is input, for the output result of the OCR, whether Chinese is contained or not is checked, the confidence coefficient is larger than 0.99, and all historical OCR results are saved for duplication elimination.

And 104, matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph.

In the embodiment of the application, the result of OCR recognition is matched with the current text paragraph, whether the OCR result is in the text or not needs to be checked, and a specific position needs to be determined.

The common substring algorithm principle is as follows: inputting a character string A and a character string B, and comparing each character in A with the character in B in turn to find out all continuous substrings. For example, input a ═ acccdc ', B ═ ACGSBCDEF', and output 'AC' and 'BCD'.

In the embodiment of the application, character comparison is carried out on an OCR recognition result and a target paragraph, all continuous common substrings are found out, a first substring is selected, and the first substring is used for representing a first continuous common substring; when the first substring is in the initial character range of the target paragraph, character comparison is carried out on the OCR recognition result corresponding to the first substring and characters in the initial character range; when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph; and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.

In an optional embodiment, when the OCR recognition result is compared with the target paragraph to find out all the continuous common substrings, and after the first substring is selected, when the first substring is within the end character range of the target paragraph, the time stamp of the OCR recognition result corresponding to the first substring is used as the start time of the next paragraph of the target paragraph.

In the embodiment of the application, character comparison is carried out on an OCR recognition result and a target paragraph, all continuous common substrings are found out, and a tail substring is selected and used for representing the last continuous common substring; when the tail string is in the ending character range in the target paragraph, character comparison is carried out on the OCR recognition result corresponding to the tail string and characters in the ending character range; when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph; and traversing each paragraph of the description text to determine the tail sentence of each paragraph of the OCR recognition result.

In an optional embodiment, the OCR recognition result is compared with the target paragraph to find out all the continuous common substrings, and after the tail substring is selected, when the tail substring is within the range of the starting character in the target paragraph, the time stamp of the OCR recognition result corresponding to the tail substring is used as the ending time of the segment on the target paragraph.

And 105, determining the duration of each paragraph of the description text in the original video according to the corresponding time stamp of the head sentence and the tail sentence of each paragraph.

As shown in fig. 5, a flow of a video subtitle matching method based on a common substring algorithm according to an alternative embodiment of the present application is given below, where when a starting character range and an ending character range are both 25 characters, a first sentence threshold and a last sentence threshold are both 4 characters, are set in the common substring algorithm:

(1) inputting a complete video, taking a frame every second, intercepting a caption area every frame, inputting OCR recognition, checking whether Chinese is contained in an output result of the OCR and the confidence coefficient is more than 0.99, and storing all historical OCR results for duplication removal.

(2) Matching the OCR recognition result with the current text paragraph requires checking if the OCR result is in the text and determining the specific location. Here a common sub-string algorithm is used for matching,

(3) for the output result of the common substring algorithm, only the first and the last substring are taken, then the positions of the two common substrings in the text are respectively searched, if the two substrings are not in the range of the first 25 characters or the last 25 characters of the text, the OCR result is considered to be useless, and the OCR result is discarded;

(4) if the first sub-string is in the range of 25 characters at the beginning of the text, the OCR result is considered to be possibly useful, the OCR result and the 25 characters at the beginning of the text are further subjected to common sub-string, then the first common sub-string is taken, and if the initial position of the first common sub-string in the 25 characters at the beginning of the text is less than 4, the OCR result is considered to be matched with the first sentence of the text;

(5) similarly to (4), if the last sub-string is within 25 characters of the end of the text, the OCR result is considered to be possibly useful, the OCR result and the 25 characters of the end of the text are further evaluated to be a common sub-string, then the last common sub-string is taken, and if the end distance from the end position of the text is less than 4, the OCR result is considered to be matched with the last sentence of the text;

(6) on the basis of (4) and (5), if the first sentence of the text is matched, recording the current time as the starting time, and if the last sentence of the text is matched, reading the next text; if the beginning of the second text is matched, the previous text is considered to be ended, and the current time is recorded as the ending time of the previous text and is also the starting time of the current text.

(7) And for the final result, performing a post-processing process, merging completely repeated contents, and merging a plurality of time ranges corresponding to the same text to obtain a final output result.

In conclusion, the method realizes the video subtitle time alignment task based on the common substring algorithm, has higher robustness, and can well process the situations that OCR is interfered by video background, wrongly written or mispronounced words and rarely-used words.

Referring to fig. 6, a block diagram of a video subtitle time alignment system 200 according to an embodiment of the present application is shown. As shown in fig. 6, the system 200 may include: the system comprises an acquisition module 201, an interception module 202, a recognition module 203, a matching module 204 and a determination module 205.

An obtaining module 201, configured to obtain an original video with subtitles and a description text, where content of the description text corresponds to content of the subtitles in the original video;

the capturing module 202 is configured to capture a subtitle region in an original video according to a preset frame capture interval to obtain a subtitle region image set, where the subtitle region image set includes a corresponding timestamp in the original video;

the recognition module 203 is used for inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp;

the matching module 204 is used for matching the OCR recognition result with each paragraph of the description text through a common substring algorithm to determine a head sentence and a tail sentence of the OCR recognition result in each paragraph;

and the determining module 205 is configured to determine the duration of each paragraph of the description text in the original video according to the timestamp corresponding to the beginning and the end of each paragraph.

For specific limitations of the video subtitle time alignment system, reference may be made to the above limitations of the video subtitle time alignment method, which is not described herein again. The modules in the video subtitle time alignment system may be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned video subtitle time alignment method.

The implementation principle and technical effect of the computer-readable storage medium provided by this embodiment are similar to those of the above-described method embodiment, and are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in M forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (SyMchliMk) DRAM (SLDRAM), RaMbus (RaMbus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for video subtitle time alignment, the method comprising:

2. The method of claim 1, wherein matching the OCR recognition result with each paragraph of the description text by a common substring algorithm to determine a first sentence of each paragraph of the OCR recognition result comprises:

3. The method of claim 2, wherein the character comparison of the OCR recognition result with the target paragraph to find all the consecutive common substrings, and after selecting the first substring, further comprising:

4. The method of claim 1, wherein matching the OCR recognition result with each paragraph of the description text by a common substring algorithm to determine a final sentence of each paragraph of the OCR recognition result comprises:

5. The method of claim 4, wherein character-comparing the OCR recognition result with the target paragraph to find all the consecutive common substrings, and after selecting the tail substring, further comprising:

6. The method of claim 1, wherein inputting the set of images of the subtitle region into an OCR recognition model for OCR recognition to obtain a time-stamped OCR recognition result comprises:

7. The method of claim 1, wherein the corresponding duration of the descriptive text in the original video is determined according to the timestamps corresponding to the first sentence and the last sentence of the descriptive text, respectively, and further comprising:

8. The method according to claim 1, wherein the description text includes wrongly written words and/or uncommon words.

9. A video subtitle time alignment system, the system comprising:

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the video subtitle time alignment method according to any one of claims 1 to 8.