CN114222193A

CN114222193A - Video subtitle time alignment model training method and system

Info

Publication number: CN114222193A
Application number: CN202111470819.0A
Authority: CN
Inventors: 程梓益
Original assignee: Beijing Moviebook Science And Technology Co ltd
Current assignee: Beijing Moviebook Science And Technology Co ltd
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-03-22
Anticipated expiration: 2041-12-03
Also published as: CN114222193B

Abstract

The application discloses a method and a system for training a video subtitle time alignment model, wherein the method comprises the steps of firstly, obtaining an original video set with subtitles and a description text set; matching the original video set with the corresponding description text set sequentially through a common substring algorithm to determine an OCR recognition result corresponding to each paragraph in the description text set; forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set; and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain the trained video subtitle time alignment model. Therefore, the video subtitle time alignment model provided by the embodiment of the application solves the problem of video subtitle time matching caused by the existence of wrongly written characters, rarely written characters and video background interference, and is more accurate compared with the existing common substring algorithm.

Description

Video subtitle time alignment model training method and system

Technical Field

The invention relates to the technical field of multimedia, in particular to a method and a system for training a video subtitle time alignment model.

Background

With the continuous development of internet technology and multimedia technology, video is popular among many users as one of information carriers. In order to better display video content, subtitles corresponding to video are generally displayed simultaneously when a user watches the video, and description text corresponding to the video subtitles also exists, however, the description text is generally divided into several or even more than ten text segments.

In the prior art, when paragraphs in description texts are time-matched with video subtitles, a common method is to use OCR to recognize characters of a current frame in a video, record the current time, and then match the current time with a corresponding text, but the common method cannot automatically complete the task because of wrongly written characters, the existence of uncommon words and the interference of a video background.

Disclosure of Invention

Based on this, the embodiment of the application provides a video subtitle time alignment model training method and system, which can improve the accuracy of time matching between a video subtitle and a description text.

In a first aspect, a method for training a video subtitle time alignment model is provided, where the method includes:

acquiring an original video set with subtitles and a description text set, wherein the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;

matching an original video set with a corresponding description text set sequentially through a common substring algorithm, and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;

forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;

and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.

Optionally, training the video subtitle time alignment model by using the training set to obtain a trained video subtitle time alignment model, including:

inputting each text segment and an OCR recognition result into the deep language model respectively, and processing to obtain a first text vector and a second text vector;

and splicing the first text vector and the second text vector, inputting the spliced first text vector and second text vector into a multilayer perceptron to obtain a training result of the current turn, comparing the training result of the current turn with the labeling information, adjusting model parameters according to the comparison result, and obtaining a trained video subtitle time alignment model when the difference between the model output result and the labeling result is smaller than a preset threshold value.

Optionally, the depth language model includes at least a BERT-chinese model or an ERNIE model.

Optionally, matching the original video set with the corresponding description text set sequentially through a common substring algorithm, and determining an OCR recognition result corresponding to each paragraph in the description text set, including:

acquiring an original video and a corresponding description text, wherein the content of the description text corresponds to the content of subtitles in the original video;

intercepting a subtitle region in the original video according to a preset frame taking interval time to obtain a subtitle region image set, wherein the subtitle region image set comprises a corresponding timestamp in the original video;

inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp;

matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph;

and determining the duration of each paragraph of the description text in the original video according to the time stamps corresponding to the first sentence and the last sentence of each paragraph, and matching the duration with the OCR recognition result with the time stamp.

Optionally, matching the OCR recognition result with each paragraph of the description text by using a common substring algorithm, and determining a first sentence of each paragraph of the OCR recognition result, including:

comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a first substring from the common substrings, wherein the first substring is used for representing a first continuous common substring;

when the first sub-string is in a starting character range in the target paragraph, performing character comparison on an OCR recognition result corresponding to the first sub-string and characters in the starting character range;

when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph;

and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.

Optionally, the character comparison between the OCR recognition result and the target paragraph to find out all the continuous common substrings, and after selecting the first substring, the method further includes:

and when the first substring is in the end character range of the target paragraph, taking the time stamp of the OCR recognition result corresponding to the first substring as the starting time of the next paragraph of the target paragraph.

Optionally, matching the OCR recognition result with each paragraph of the description text by a common substring algorithm, and determining a tail sentence of the OCR recognition result in each paragraph, including:

comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a tail substring from the common substrings, wherein the tail substring is used for representing the last continuous common substring;

when the tail string is in the ending character range in the target paragraph, performing character comparison on the OCR recognition result corresponding to the tail string and characters in the ending character range;

when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph;

and traversing each paragraph of the description text, and determining the tail sentence of the OCR recognition result in each paragraph.

Optionally, the character comparison between the OCR recognition result and the target paragraph to find out all the continuous common substrings, and after selecting the tail substring, the method further includes:

and when the tail substring is in the range of the initial characters in the target paragraph, taking the time stamp of the OCR recognition result corresponding to the tail substring as the end time of a section on the target paragraph.

Optionally, the description text includes wrongly written words and/or uncommon words.

In a second aspect, a video subtitle time alignment model training system is provided, which includes:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an original video set with subtitles and a description text set, the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;

the determining module is used for matching the original video set with the corresponding description text set sequentially through a common substring algorithm and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;

the construction module is used for forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;

and the training module is used for constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.

In the technical scheme provided by the embodiment of the application, an original video set with subtitles and a description text set are obtained firstly, wherein the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set; matching the original video set with the corresponding description text set sequentially through a common substring algorithm to determine an OCR recognition result corresponding to each paragraph in the description text set; forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set; and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain the trained video subtitle time alignment model. Therefore, the video subtitle time alignment model provided by the embodiment of the application solves the problem of video subtitle time matching caused by the existence of wrongly written characters, rarely written characters and video background interference, and is more accurate compared with the existing common substring algorithm.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

Fig. 1 is a flowchart illustrating steps of a method for training a video subtitle time alignment model according to an embodiment of the present disclosure;

FIG. 2 is a diagram illustrating a result of a common substring algorithm provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a video subtitle area according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating tagging of a data set according to an embodiment of the present application;

fig. 5 is a diagram of a BERT model structure provided in an embodiment of the present application;

fig. 6 is a diagram of a video subtitle time alignment model structure according to an embodiment of the present application;

fig. 7 is a block diagram of a video subtitle time alignment model training system according to an embodiment of the present application.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for aligning time by matching by using the existing common substring algorithm can only obtain 50% of accuracy (only 50 results can be finally obtained by inputting 100 videos) because too many wrongly written characters exist in the text during actual test, so that the application provides an improved method which can obtain better effect theoretically.

According to the technical scheme, a batch of videos and corresponding description texts are processed by using a common substring method, then manual review is carried out, failed cases are removed, and then each section of text and video OCR content corresponding to the section of text are used as a data set to train a text similarity model.

To facilitate understanding of the present embodiment, a detailed description is first given of a training method for a video subtitle time alignment model disclosed in the embodiments of the present application. Referring to fig. 1, a flowchart of a method for training a video subtitle time alignment model according to an embodiment of the present application is shown, where the method may include the following steps:

step 101, obtaining an original video set with subtitles and a description text set.

The original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding descriptive text in the set of descriptive text.

In the embodiment of the application, a batch of videos (more than 100) and corresponding description texts are input, the text content corresponds to subtitles in the videos, but wrong characters exist in the texts, such as "marks" in the description texts which should correspond to "peaks" of the subtitles, and matching may be unsuccessful due to the existence of the wrong characters when matching is performed by using the existing text matching method. The text is divided into sections according to punctuation marks, and the number of characters in each section is not fixed.

And 102, matching the original video set with the corresponding description text set sequentially through a common substring algorithm, and determining an OCR (optical character recognition) result corresponding to each paragraph in the description text set.

The OCR recognition result is used for representing the subtitle content in the original video set; the batch of videos are processed by a common substring method, and after the failure cases are removed through manual review, the duration corresponding to each text segment is obtained, as shown in fig. 2.

Specifically, each original video and the corresponding description text are matched through a common substring algorithm to obtain OCR recognition results of each paragraph in the description text as follows:

step 1021, acquiring the original video and the corresponding description text.

Wherein the content of the description text corresponds to the content of the subtitles in the original video;

and step 1022, intercepting the subtitle region in the original video according to the preset frame taking interval time to obtain a subtitle region image set.

And step 1023, inputting the subtitle area image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp.

The subtitle region image set includes a corresponding timestamp in the original video, and the preset frame taking interval time may be one second.

In the embodiment of the application, a complete original video with subtitles is input, a frame is taken every second, as shown in fig. 3, a subtitle area is intercepted every frame, OCR recognition is input, for an output result of the OCR, whether Chinese is contained or not and the confidence degree is greater than 0.99 is checked, and all historical OCR results are saved for deduplication.

And step 1024, matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph.

In the embodiment of the application, the result of OCR recognition is matched with the current text paragraph, whether the OCR result is in the text or not needs to be checked, and a specific position needs to be determined.

The common substring algorithm principle is as follows: inputting a character string A and a character string B, and comparing each character in A with the character in B in turn to find out all continuous substrings. For example, input a ═ acccdc ', B ═ ACGSBCDEF', and output 'AC' and 'BCD'.

In the embodiment of the application, character comparison is carried out on an OCR recognition result and a target paragraph, all continuous common substrings are found out, a first substring is selected, and the first substring is used for representing a first continuous common substring; when the first substring is in the initial character range of the target paragraph, character comparison is carried out on the OCR recognition result corresponding to the first substring and characters in the initial character range; when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph; and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.

In an optional embodiment, when the OCR recognition result is compared with the target paragraph to find out all the continuous common substrings, and after the first substring is selected, when the first substring is within the end character range of the target paragraph, the time stamp of the OCR recognition result corresponding to the first substring is used as the start time of the next paragraph of the target paragraph.

In the embodiment of the application, character comparison is carried out on an OCR recognition result and a target paragraph, all continuous common substrings are found out, and a tail substring is selected and used for representing the last continuous common substring; when the tail string is in the ending character range in the target paragraph, character comparison is carried out on the OCR recognition result corresponding to the tail string and characters in the ending character range; when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph; and traversing each paragraph of the description text to determine the tail sentence of each paragraph of the OCR recognition result.

In an optional embodiment, the OCR recognition result is compared with the target paragraph to find out all the continuous common substrings, and after the tail substring is selected, when the tail substring is within the range of the starting character in the target paragraph, the time stamp of the OCR recognition result corresponding to the tail substring is used as the ending time of the segment on the target paragraph.

And 1025, determining the duration of each paragraph of the description text in the original video according to the time stamp corresponding to the first sentence and the last sentence of each paragraph respectively, and matching the duration with the OCR recognition result with the time stamp.

And when the timestamp corresponding to the first sentence is overlapped with the timestamp corresponding to the last sentence, taking the duration after the time ranges are combined as an output result.

The following provides a flow of a video subtitle matching method based on a common substring algorithm according to an alternative embodiment of the present application, where a common substring algorithm is configured to set that when a starting character range and an ending character range are both 25 characters, a first sentence threshold and a last sentence threshold are both 4 characters:

(1) inputting a complete video, taking a frame every second, intercepting a caption area every frame, inputting OCR recognition, checking whether Chinese is contained in an output result of the OCR and the confidence coefficient is more than 0.99, and storing all historical OCR results for duplication removal.

(2) Matching the OCR recognition result with the current text paragraph requires checking if the OCR result is in the text and determining the specific location. Here a common sub-string algorithm is used for matching,

(3) for the output result of the common substring algorithm, only the first and the last substring are taken, then the positions of the two common substrings in the text are respectively searched, if the two substrings are not in the range of the first 25 characters or the last 25 characters of the text, the OCR result is considered to be useless, and the OCR result is discarded;

(4) if the first sub-string is in the range of 25 characters at the beginning of the text, the OCR result is considered to be possibly useful, the OCR result and the 25 characters at the beginning of the text are further subjected to common sub-string, then the first common sub-string is taken, and if the initial position of the first common sub-string in the 25 characters at the beginning of the text is less than 4, the OCR result is considered to be matched with the first sentence of the text;

(5) similarly to (4), if the last sub-string is within 25 characters of the end of the text, the OCR result is considered to be possibly useful, the OCR result and the 25 characters of the end of the text are further evaluated to be a common sub-string, then the last common sub-string is taken, and if the end distance from the end position of the text is less than 4, the OCR result is considered to be matched with the last sentence of the text;

(6) on the basis of (4) and (5), if the first sentence of the text is matched, recording the current time as the starting time, and if the last sentence of the text is matched, reading the next text; if the beginning of the second text is matched, the previous text is considered to be ended, and the current time is recorded as the ending time of the previous text and is also the starting time of the current text.

(7) And for the final result, performing a post-processing process, merging completely repeated contents, and merging a plurality of time ranges corresponding to the same text to obtain an output result.

And 103, forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set.

In the embodiment of the application, a specific labeling process includes making each piece of text and video OCR content corresponding to the piece of text into a semantic similarity data set, where the data set is divided into three columns, the first column is an original text segment, the second column is a result of OCR recognition, the third column is a label, the label is 0 or 1, 0 represents dissimilarity, 1 represents similarity, and each column is separated by a tab symbol.

The labeling process for a particular data set is such that for an original text segment, all OCR recognition results within its duration are placed as a sentence in the second column, then labeled as 1 in the third column, and then two text segments before and after the duration of the text segment are labeled as 0, so that for each text segment, the labeled data shown in fig. 4 can be obtained.

And 104, constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain a trained video subtitle time alignment model.

In the embodiment of the present application, after each data set is available, the data set can be used for training a model, and any depth language model pre-trained in chinese can be used, such as BERT-chinese, ERNIE, etc., taking BERT as an example, and the model structure is shown in fig. 5.

Where [ cls ] is the beginning of each sentence, Tok1- - -TokN represents all characters in a sentence, E1- - -En is the coding vector of each character, the coding vector is character coding + position coding, and the character coding can use the existing word2vec, or can be directly initialized to any vector at random.

The position-coding PE is calculated by:

wherein pos represents a position index, for example, 0 ≦ pos ≦ 255 if the length of the input text is 256; d_modelRepresenting the vector dimension of the model, i representing a dimension index, such as the model vector dimension d_modelWhen the value is 512, i is more than or equal to 0 and less than or equal to 511.

The output after BERT, C is the vector representing the whole sentence, T1-Tn is the vector representing Tok 1-Tokn, and here only the vector C capable of representing the whole sentence needs to be used.

Wherein the process of each round of training specifically comprises:

inputting each text segment and an OCR recognition result into the deep language model respectively, and processing to obtain a first text vector and a second text vector; and splicing the first text vector and the second text vector, inputting the spliced first text vector and second text vector into a multilayer perceptron to obtain a training result of the current turn, comparing the training result of the current turn with the labeling information, adjusting model parameters according to the comparison result, and obtaining a trained video subtitle time alignment model when the difference between the model output result and the labeling result is smaller than a preset threshold value.

Specifically, as shown in fig. 6, a sentence a and a sentence B are input to obtain a vector u of the sentence a and a vector v of the sentence B, the two vectors are spliced and input to a classifier, the classifier is a multi-layer perceptron, the spliced sentence vector is input and

output

0 or 1, the training process of the model is that the final output is consistent with the labeling result through back propagation, and after the model is trained in such a way, the model can replace the existing common substring algorithm, and better accuracy can be obtained on video caption time alignment.

Referring to fig. 7, a block diagram of a video subtitle time alignment model training system 200 according to an embodiment of the present application is shown. As shown in fig. 7, the system 200 may include: the system comprises an acquisition module 201, a determination module 202, a construction module 203 and a training module 204.

An obtaining module 201, configured to obtain an original video set with subtitles and a description text set, where the original video set includes multiple original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;

the determining module 202 is configured to match the original video set with the corresponding description text set sequentially through a common substring algorithm, and determine an OCR recognition result corresponding to each paragraph in the description text set; OCR recognition results are used for representing the subtitle content in the original video set;

the building module 203 is configured to form a data set according to each segment of text and an OCR recognition result corresponding to the segment of text, and label the data set to obtain a training set;

the training module 204 is configured to construct a video subtitle time alignment model based on text semantic similarity matching, and train the video subtitle time alignment model by using a training set to obtain a trained video subtitle time alignment model.

For specific limitations of the video subtitle time alignment model training system, reference may be made to the above limitations of the video subtitle time alignment model training method, and details are not repeated here. All or part of the modules in the video subtitle time alignment model training system can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for training a video subtitle time alignment model, the method comprising:

2. The method of claim 1, wherein training the video caption time alignment model using the training set to obtain a trained video caption time alignment model comprises:

3. The method of claim 2, wherein the depth language model comprises at least a BERT-chinese model or an ERNIE model.

4. The method of claim 1, wherein the step of sequentially matching the original video set with the corresponding description text set by a common substring algorithm to determine the OCR recognition result corresponding to each paragraph in the description text set comprises:

5. The method of claim 4, wherein matching the OCR recognition result with each paragraph of the description text by a common substring algorithm to determine a first sentence of each paragraph of the OCR recognition result comprises:

6. The method of claim 5, wherein character-comparing the OCR recognition result with the target paragraph to find all the consecutive common substrings, and after selecting the first substring, further comprising:

7. The method of claim 4, wherein matching the OCR recognition result with each paragraph of the description text by a common substring algorithm to determine a final sentence of each paragraph of the OCR recognition result comprises:

8. The method of claim 7, wherein character-comparing the OCR recognition result with the target paragraph to find all the consecutive common substrings, and after selecting the tail substring, further comprising:

9. The method according to claim 1, wherein the description text includes wrongly written words and/or uncommon words.

10. A video subtitle temporal alignment model training system, the system comprising: