CN114222193A - Video subtitle time alignment model training method and system - Google Patents
Video subtitle time alignment model training method and system Download PDFInfo
- Publication number
- CN114222193A CN114222193A CN202111470819.0A CN202111470819A CN114222193A CN 114222193 A CN114222193 A CN 114222193A CN 202111470819 A CN202111470819 A CN 202111470819A CN 114222193 A CN114222193 A CN 114222193A
- Authority
- CN
- China
- Prior art keywords
- recognition result
- ocr recognition
- paragraph
- text
- description text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000002372 labelling Methods 0.000 claims abstract description 16
- 239000013598 vector Substances 0.000 claims description 30
- 238000010276 construction Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 claims 1
- 238000012015 optical character recognition Methods 0.000 description 78
- 238000010586 diagram Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 239000000969 carrier Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/435—Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Signal Processing (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Character Discrimination (AREA)
Abstract
The application discloses a method and a system for training a video subtitle time alignment model, wherein the method comprises the steps of firstly, obtaining an original video set with subtitles and a description text set; matching the original video set with the corresponding description text set sequentially through a common substring algorithm to determine an OCR recognition result corresponding to each paragraph in the description text set; forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set; and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain the trained video subtitle time alignment model. Therefore, the video subtitle time alignment model provided by the embodiment of the application solves the problem of video subtitle time matching caused by the existence of wrongly written characters, rarely written characters and video background interference, and is more accurate compared with the existing common substring algorithm.
Description
Technical Field
The invention relates to the technical field of multimedia, in particular to a method and a system for training a video subtitle time alignment model.
Background
With the continuous development of internet technology and multimedia technology, video is popular among many users as one of information carriers. In order to better display video content, subtitles corresponding to video are generally displayed simultaneously when a user watches the video, and description text corresponding to the video subtitles also exists, however, the description text is generally divided into several or even more than ten text segments.
In the prior art, when paragraphs in description texts are time-matched with video subtitles, a common method is to use OCR to recognize characters of a current frame in a video, record the current time, and then match the current time with a corresponding text, but the common method cannot automatically complete the task because of wrongly written characters, the existence of uncommon words and the interference of a video background.
Disclosure of Invention
Based on this, the embodiment of the application provides a video subtitle time alignment model training method and system, which can improve the accuracy of time matching between a video subtitle and a description text.
In a first aspect, a method for training a video subtitle time alignment model is provided, where the method includes:
acquiring an original video set with subtitles and a description text set, wherein the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
matching an original video set with a corresponding description text set sequentially through a common substring algorithm, and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;
forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;
and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.
Optionally, training the video subtitle time alignment model by using the training set to obtain a trained video subtitle time alignment model, including:
inputting each text segment and an OCR recognition result into the deep language model respectively, and processing to obtain a first text vector and a second text vector;
and splicing the first text vector and the second text vector, inputting the spliced first text vector and second text vector into a multilayer perceptron to obtain a training result of the current turn, comparing the training result of the current turn with the labeling information, adjusting model parameters according to the comparison result, and obtaining a trained video subtitle time alignment model when the difference between the model output result and the labeling result is smaller than a preset threshold value.
Optionally, the depth language model includes at least a BERT-chinese model or an ERNIE model.
Optionally, matching the original video set with the corresponding description text set sequentially through a common substring algorithm, and determining an OCR recognition result corresponding to each paragraph in the description text set, including:
acquiring an original video and a corresponding description text, wherein the content of the description text corresponds to the content of subtitles in the original video;
intercepting a subtitle region in the original video according to a preset frame taking interval time to obtain a subtitle region image set, wherein the subtitle region image set comprises a corresponding timestamp in the original video;
inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp;
matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph;
and determining the duration of each paragraph of the description text in the original video according to the time stamps corresponding to the first sentence and the last sentence of each paragraph, and matching the duration with the OCR recognition result with the time stamp.
Optionally, matching the OCR recognition result with each paragraph of the description text by using a common substring algorithm, and determining a first sentence of each paragraph of the OCR recognition result, including:
comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a first substring from the common substrings, wherein the first substring is used for representing a first continuous common substring;
when the first sub-string is in a starting character range in the target paragraph, performing character comparison on an OCR recognition result corresponding to the first sub-string and characters in the starting character range;
when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph;
and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.
Optionally, the character comparison between the OCR recognition result and the target paragraph to find out all the continuous common substrings, and after selecting the first substring, the method further includes:
and when the first substring is in the end character range of the target paragraph, taking the time stamp of the OCR recognition result corresponding to the first substring as the starting time of the next paragraph of the target paragraph.
Optionally, matching the OCR recognition result with each paragraph of the description text by a common substring algorithm, and determining a tail sentence of the OCR recognition result in each paragraph, including:
comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a tail substring from the common substrings, wherein the tail substring is used for representing the last continuous common substring;
when the tail string is in the ending character range in the target paragraph, performing character comparison on the OCR recognition result corresponding to the tail string and characters in the ending character range;
when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph;
and traversing each paragraph of the description text, and determining the tail sentence of the OCR recognition result in each paragraph.
Optionally, the character comparison between the OCR recognition result and the target paragraph to find out all the continuous common substrings, and after selecting the tail substring, the method further includes:
and when the tail substring is in the range of the initial characters in the target paragraph, taking the time stamp of the OCR recognition result corresponding to the tail substring as the end time of a section on the target paragraph.
Optionally, the description text includes wrongly written words and/or uncommon words.
In a second aspect, a video subtitle time alignment model training system is provided, which includes:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an original video set with subtitles and a description text set, the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
the determining module is used for matching the original video set with the corresponding description text set sequentially through a common substring algorithm and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;
the construction module is used for forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;
and the training module is used for constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.
In the technical scheme provided by the embodiment of the application, an original video set with subtitles and a description text set are obtained firstly, wherein the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set; matching the original video set with the corresponding description text set sequentially through a common substring algorithm to determine an OCR recognition result corresponding to each paragraph in the description text set; forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set; and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain the trained video subtitle time alignment model. Therefore, the video subtitle time alignment model provided by the embodiment of the application solves the problem of video subtitle time matching caused by the existence of wrongly written characters, rarely written characters and video background interference, and is more accurate compared with the existing common substring algorithm.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
Fig. 1 is a flowchart illustrating steps of a method for training a video subtitle time alignment model according to an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a result of a common substring algorithm provided in an embodiment of the present application;
fig. 3 is a schematic diagram of a video subtitle area according to an embodiment of the present application;
fig. 4 is a schematic diagram illustrating tagging of a data set according to an embodiment of the present application;
fig. 5 is a diagram of a BERT model structure provided in an embodiment of the present application;
fig. 6 is a diagram of a video subtitle time alignment model structure according to an embodiment of the present application;
fig. 7 is a block diagram of a video subtitle time alignment model training system according to an embodiment of the present application.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for aligning time by matching by using the existing common substring algorithm can only obtain 50% of accuracy (only 50 results can be finally obtained by inputting 100 videos) because too many wrongly written characters exist in the text during actual test, so that the application provides an improved method which can obtain better effect theoretically.
According to the technical scheme, a batch of videos and corresponding description texts are processed by using a common substring method, then manual review is carried out, failed cases are removed, and then each section of text and video OCR content corresponding to the section of text are used as a data set to train a text similarity model.
To facilitate understanding of the present embodiment, a detailed description is first given of a training method for a video subtitle time alignment model disclosed in the embodiments of the present application. Referring to fig. 1, a flowchart of a method for training a video subtitle time alignment model according to an embodiment of the present application is shown, where the method may include the following steps:
The original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding descriptive text in the set of descriptive text.
In the embodiment of the application, a batch of videos (more than 100) and corresponding description texts are input, the text content corresponds to subtitles in the videos, but wrong characters exist in the texts, such as "marks" in the description texts which should correspond to "peaks" of the subtitles, and matching may be unsuccessful due to the existence of the wrong characters when matching is performed by using the existing text matching method. The text is divided into sections according to punctuation marks, and the number of characters in each section is not fixed.
And 102, matching the original video set with the corresponding description text set sequentially through a common substring algorithm, and determining an OCR (optical character recognition) result corresponding to each paragraph in the description text set.
The OCR recognition result is used for representing the subtitle content in the original video set; the batch of videos are processed by a common substring method, and after the failure cases are removed through manual review, the duration corresponding to each text segment is obtained, as shown in fig. 2.
Specifically, each original video and the corresponding description text are matched through a common substring algorithm to obtain OCR recognition results of each paragraph in the description text as follows:
step 1021, acquiring the original video and the corresponding description text.
Wherein the content of the description text corresponds to the content of the subtitles in the original video;
and step 1022, intercepting the subtitle region in the original video according to the preset frame taking interval time to obtain a subtitle region image set.
And step 1023, inputting the subtitle area image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp.
The subtitle region image set includes a corresponding timestamp in the original video, and the preset frame taking interval time may be one second.
In the embodiment of the application, a complete original video with subtitles is input, a frame is taken every second, as shown in fig. 3, a subtitle area is intercepted every frame, OCR recognition is input, for an output result of the OCR, whether Chinese is contained or not and the confidence degree is greater than 0.99 is checked, and all historical OCR results are saved for deduplication.
And step 1024, matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph.
In the embodiment of the application, the result of OCR recognition is matched with the current text paragraph, whether the OCR result is in the text or not needs to be checked, and a specific position needs to be determined.
The common substring algorithm principle is as follows: inputting a character string A and a character string B, and comparing each character in A with the character in B in turn to find out all continuous substrings. For example, input a ═ acccdc ', B ═ ACGSBCDEF', and output 'AC' and 'BCD'.
In the embodiment of the application, character comparison is carried out on an OCR recognition result and a target paragraph, all continuous common substrings are found out, a first substring is selected, and the first substring is used for representing a first continuous common substring; when the first substring is in the initial character range of the target paragraph, character comparison is carried out on the OCR recognition result corresponding to the first substring and characters in the initial character range; when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph; and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.
In an optional embodiment, when the OCR recognition result is compared with the target paragraph to find out all the continuous common substrings, and after the first substring is selected, when the first substring is within the end character range of the target paragraph, the time stamp of the OCR recognition result corresponding to the first substring is used as the start time of the next paragraph of the target paragraph.
In the embodiment of the application, character comparison is carried out on an OCR recognition result and a target paragraph, all continuous common substrings are found out, and a tail substring is selected and used for representing the last continuous common substring; when the tail string is in the ending character range in the target paragraph, character comparison is carried out on the OCR recognition result corresponding to the tail string and characters in the ending character range; when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph; and traversing each paragraph of the description text to determine the tail sentence of each paragraph of the OCR recognition result.
In an optional embodiment, the OCR recognition result is compared with the target paragraph to find out all the continuous common substrings, and after the tail substring is selected, when the tail substring is within the range of the starting character in the target paragraph, the time stamp of the OCR recognition result corresponding to the tail substring is used as the ending time of the segment on the target paragraph.
And 1025, determining the duration of each paragraph of the description text in the original video according to the time stamp corresponding to the first sentence and the last sentence of each paragraph respectively, and matching the duration with the OCR recognition result with the time stamp.
And when the timestamp corresponding to the first sentence is overlapped with the timestamp corresponding to the last sentence, taking the duration after the time ranges are combined as an output result.
The following provides a flow of a video subtitle matching method based on a common substring algorithm according to an alternative embodiment of the present application, where a common substring algorithm is configured to set that when a starting character range and an ending character range are both 25 characters, a first sentence threshold and a last sentence threshold are both 4 characters:
(1) inputting a complete video, taking a frame every second, intercepting a caption area every frame, inputting OCR recognition, checking whether Chinese is contained in an output result of the OCR and the confidence coefficient is more than 0.99, and storing all historical OCR results for duplication removal.
(2) Matching the OCR recognition result with the current text paragraph requires checking if the OCR result is in the text and determining the specific location. Here a common sub-string algorithm is used for matching,
(3) for the output result of the common substring algorithm, only the first and the last substring are taken, then the positions of the two common substrings in the text are respectively searched, if the two substrings are not in the range of the first 25 characters or the last 25 characters of the text, the OCR result is considered to be useless, and the OCR result is discarded;
(4) if the first sub-string is in the range of 25 characters at the beginning of the text, the OCR result is considered to be possibly useful, the OCR result and the 25 characters at the beginning of the text are further subjected to common sub-string, then the first common sub-string is taken, and if the initial position of the first common sub-string in the 25 characters at the beginning of the text is less than 4, the OCR result is considered to be matched with the first sentence of the text;
(5) similarly to (4), if the last sub-string is within 25 characters of the end of the text, the OCR result is considered to be possibly useful, the OCR result and the 25 characters of the end of the text are further evaluated to be a common sub-string, then the last common sub-string is taken, and if the end distance from the end position of the text is less than 4, the OCR result is considered to be matched with the last sentence of the text;
(6) on the basis of (4) and (5), if the first sentence of the text is matched, recording the current time as the starting time, and if the last sentence of the text is matched, reading the next text; if the beginning of the second text is matched, the previous text is considered to be ended, and the current time is recorded as the ending time of the previous text and is also the starting time of the current text.
(7) And for the final result, performing a post-processing process, merging completely repeated contents, and merging a plurality of time ranges corresponding to the same text to obtain an output result.
And 103, forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set.
In the embodiment of the application, a specific labeling process includes making each piece of text and video OCR content corresponding to the piece of text into a semantic similarity data set, where the data set is divided into three columns, the first column is an original text segment, the second column is a result of OCR recognition, the third column is a label, the label is 0 or 1, 0 represents dissimilarity, 1 represents similarity, and each column is separated by a tab symbol.
The labeling process for a particular data set is such that for an original text segment, all OCR recognition results within its duration are placed as a sentence in the second column, then labeled as 1 in the third column, and then two text segments before and after the duration of the text segment are labeled as 0, so that for each text segment, the labeled data shown in fig. 4 can be obtained.
And 104, constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain a trained video subtitle time alignment model.
In the embodiment of the present application, after each data set is available, the data set can be used for training a model, and any depth language model pre-trained in chinese can be used, such as BERT-chinese, ERNIE, etc., taking BERT as an example, and the model structure is shown in fig. 5.
Where [ cls ] is the beginning of each sentence, Tok1- - -TokN represents all characters in a sentence, E1- - -En is the coding vector of each character, the coding vector is character coding + position coding, and the character coding can use the existing word2vec, or can be directly initialized to any vector at random.
The position-coding PE is calculated by:
wherein pos represents a position index, for example, 0 ≦ pos ≦ 255 if the length of the input text is 256; dmodelRepresenting the vector dimension of the model, i representing a dimension index, such as the model vector dimension dmodelWhen the value is 512, i is more than or equal to 0 and less than or equal to 511.
The output after BERT, C is the vector representing the whole sentence, T1-Tn is the vector representing Tok 1-Tokn, and here only the vector C capable of representing the whole sentence needs to be used.
Wherein the process of each round of training specifically comprises:
inputting each text segment and an OCR recognition result into the deep language model respectively, and processing to obtain a first text vector and a second text vector; and splicing the first text vector and the second text vector, inputting the spliced first text vector and second text vector into a multilayer perceptron to obtain a training result of the current turn, comparing the training result of the current turn with the labeling information, adjusting model parameters according to the comparison result, and obtaining a trained video subtitle time alignment model when the difference between the model output result and the labeling result is smaller than a preset threshold value.
Specifically, as shown in fig. 6, a sentence a and a sentence B are input to obtain a vector u of the sentence a and a vector v of the sentence B, the two vectors are spliced and input to a classifier, the classifier is a multi-layer perceptron, the spliced sentence vector is input and output 0 or 1, the training process of the model is that the final output is consistent with the labeling result through back propagation, and after the model is trained in such a way, the model can replace the existing common substring algorithm, and better accuracy can be obtained on video caption time alignment.
Referring to fig. 7, a block diagram of a video subtitle time alignment model training system 200 according to an embodiment of the present application is shown. As shown in fig. 7, the system 200 may include: the system comprises an acquisition module 201, a determination module 202, a construction module 203 and a training module 204.
An obtaining module 201, configured to obtain an original video set with subtitles and a description text set, where the original video set includes multiple original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
the determining module 202 is configured to match the original video set with the corresponding description text set sequentially through a common substring algorithm, and determine an OCR recognition result corresponding to each paragraph in the description text set; OCR recognition results are used for representing the subtitle content in the original video set;
the building module 203 is configured to form a data set according to each segment of text and an OCR recognition result corresponding to the segment of text, and label the data set to obtain a training set;
the training module 204 is configured to construct a video subtitle time alignment model based on text semantic similarity matching, and train the video subtitle time alignment model by using a training set to obtain a trained video subtitle time alignment model.
For specific limitations of the video subtitle time alignment model training system, reference may be made to the above limitations of the video subtitle time alignment model training method, and details are not repeated here. All or part of the modules in the video subtitle time alignment model training system can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method for training a video subtitle time alignment model, the method comprising:
acquiring an original video set with subtitles and a description text set, wherein the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
matching an original video set with a corresponding description text set sequentially through a common substring algorithm, and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;
forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;
and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.
2. The method of claim 1, wherein training the video caption time alignment model using the training set to obtain a trained video caption time alignment model comprises:
inputting each text segment and an OCR recognition result into the deep language model respectively, and processing to obtain a first text vector and a second text vector;
and splicing the first text vector and the second text vector, inputting the spliced first text vector and second text vector into a multilayer perceptron to obtain a training result of the current turn, comparing the training result of the current turn with the labeling information, adjusting model parameters according to the comparison result, and obtaining a trained video subtitle time alignment model when the difference between the model output result and the labeling result is smaller than a preset threshold value.
3. The method of claim 2, wherein the depth language model comprises at least a BERT-chinese model or an ERNIE model.
4. The method of claim 1, wherein the step of sequentially matching the original video set with the corresponding description text set by a common substring algorithm to determine the OCR recognition result corresponding to each paragraph in the description text set comprises:
acquiring an original video and a corresponding description text, wherein the content of the description text corresponds to the content of subtitles in the original video;
intercepting a subtitle region in the original video according to a preset frame taking interval time to obtain a subtitle region image set, wherein the subtitle region image set comprises a corresponding timestamp in the original video;
inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp;
matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph;
and determining the duration of each paragraph of the description text in the original video according to the time stamps corresponding to the first sentence and the last sentence of each paragraph, and matching the duration with the OCR recognition result with the time stamp.
5. The method of claim 4, wherein matching the OCR recognition result with each paragraph of the description text by a common substring algorithm to determine a first sentence of each paragraph of the OCR recognition result comprises:
comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a first substring from the common substrings, wherein the first substring is used for representing a first continuous common substring;
when the first sub-string is in a starting character range in the target paragraph, performing character comparison on an OCR recognition result corresponding to the first sub-string and characters in the starting character range;
when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph;
and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.
6. The method of claim 5, wherein character-comparing the OCR recognition result with the target paragraph to find all the consecutive common substrings, and after selecting the first substring, further comprising:
and when the first substring is in the end character range of the target paragraph, taking the time stamp of the OCR recognition result corresponding to the first substring as the starting time of the next paragraph of the target paragraph.
7. The method of claim 4, wherein matching the OCR recognition result with each paragraph of the description text by a common substring algorithm to determine a final sentence of each paragraph of the OCR recognition result comprises:
comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a tail substring from the common substrings, wherein the tail substring is used for representing the last continuous common substring;
when the tail string is in the ending character range in the target paragraph, performing character comparison on the OCR recognition result corresponding to the tail string and characters in the ending character range;
when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph;
and traversing each paragraph of the description text, and determining the tail sentence of the OCR recognition result in each paragraph.
8. The method of claim 7, wherein character-comparing the OCR recognition result with the target paragraph to find all the consecutive common substrings, and after selecting the tail substring, further comprising:
and when the tail substring is in the range of the initial characters in the target paragraph, taking the time stamp of the OCR recognition result corresponding to the tail substring as the end time of a section on the target paragraph.
9. The method according to claim 1, wherein the description text includes wrongly written words and/or uncommon words.
10. A video subtitle temporal alignment model training system, the system comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an original video set with subtitles and a description text set, the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
the determining module is used for matching the original video set with the corresponding description text set sequentially through a common substring algorithm and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;
the construction module is used for forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;
and the training module is used for constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111470819.0A CN114222193B (en) | 2021-12-03 | 2021-12-03 | Video subtitle time alignment model training method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111470819.0A CN114222193B (en) | 2021-12-03 | 2021-12-03 | Video subtitle time alignment model training method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114222193A true CN114222193A (en) | 2022-03-22 |
CN114222193B CN114222193B (en) | 2024-01-05 |
Family
ID=80699646
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111470819.0A Active CN114222193B (en) | 2021-12-03 | 2021-12-03 | Video subtitle time alignment model training method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114222193B (en) |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060253439A1 (en) * | 2005-05-09 | 2006-11-09 | Liwei Ren | Matching engine for querying relevant documents |
CN105338419A (en) * | 2015-10-29 | 2016-02-17 | 网易传媒科技(北京)有限公司 | Subtitle collection generating method and apparatus |
CN106210840A (en) * | 2016-06-29 | 2016-12-07 | 网易传媒科技(北京)有限公司 | A kind of text display method and equipment |
CN106604125A (en) * | 2016-12-29 | 2017-04-26 | 北京奇艺世纪科技有限公司 | Video subtitle determining method and video subtitle determining device |
CN108259963A (en) * | 2018-03-19 | 2018-07-06 | 成都星环科技有限公司 | A kind of TV ends player |
CN108683924A (en) * | 2018-05-30 | 2018-10-19 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus of video processing |
US20190037168A1 (en) * | 2016-03-15 | 2019-01-31 | Sony Corporation | Transmission device, transmission method, reception device and reception method |
CN111193878A (en) * | 2020-01-03 | 2020-05-22 | 北京字节跳动网络技术有限公司 | Multimedia text information processing method, device, medium and electronic equipment |
CN111309200A (en) * | 2020-01-17 | 2020-06-19 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and storage medium for determining extended reading content |
US20200320308A1 (en) * | 2019-04-08 | 2020-10-08 | Nedelco, Incorporated | Identifying and tracking words in a video recording of captioning session |
US20200322570A1 (en) * | 2019-04-08 | 2020-10-08 | Baidu.Com Times Technology (Beijing) Co., Ltd. | Method and apparatus for aligning paragraph and video |
CN111931775A (en) * | 2020-09-28 | 2020-11-13 | 成都索贝数码科技股份有限公司 | Method, system, computer device and storage medium for automatically acquiring news headlines |
CN112084788A (en) * | 2020-08-19 | 2020-12-15 | 北京影谱科技股份有限公司 | Automatic marking method and system for implicit emotional tendency of image captions |
CN113033190A (en) * | 2021-04-19 | 2021-06-25 | 北京有竹居网络技术有限公司 | Subtitle generating method, device, medium and electronic equipment |
CN113159034A (en) * | 2021-04-23 | 2021-07-23 | 杭州电子科技大学 | Method and system for automatically generating subtitles by using short video |
CN113191133A (en) * | 2021-04-21 | 2021-07-30 | 北京邮电大学 | Audio text alignment method and system based on Doc2Vec |
-
2021
- 2021-12-03 CN CN202111470819.0A patent/CN114222193B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060253439A1 (en) * | 2005-05-09 | 2006-11-09 | Liwei Ren | Matching engine for querying relevant documents |
CN105338419A (en) * | 2015-10-29 | 2016-02-17 | 网易传媒科技(北京)有限公司 | Subtitle collection generating method and apparatus |
US20190037168A1 (en) * | 2016-03-15 | 2019-01-31 | Sony Corporation | Transmission device, transmission method, reception device and reception method |
CN106210840A (en) * | 2016-06-29 | 2016-12-07 | 网易传媒科技(北京)有限公司 | A kind of text display method and equipment |
CN106604125A (en) * | 2016-12-29 | 2017-04-26 | 北京奇艺世纪科技有限公司 | Video subtitle determining method and video subtitle determining device |
CN108259963A (en) * | 2018-03-19 | 2018-07-06 | 成都星环科技有限公司 | A kind of TV ends player |
CN108683924A (en) * | 2018-05-30 | 2018-10-19 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus of video processing |
US20200322570A1 (en) * | 2019-04-08 | 2020-10-08 | Baidu.Com Times Technology (Beijing) Co., Ltd. | Method and apparatus for aligning paragraph and video |
US20200320308A1 (en) * | 2019-04-08 | 2020-10-08 | Nedelco, Incorporated | Identifying and tracking words in a video recording of captioning session |
CN111193878A (en) * | 2020-01-03 | 2020-05-22 | 北京字节跳动网络技术有限公司 | Multimedia text information processing method, device, medium and electronic equipment |
CN111309200A (en) * | 2020-01-17 | 2020-06-19 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and storage medium for determining extended reading content |
CN112084788A (en) * | 2020-08-19 | 2020-12-15 | 北京影谱科技股份有限公司 | Automatic marking method and system for implicit emotional tendency of image captions |
CN111931775A (en) * | 2020-09-28 | 2020-11-13 | 成都索贝数码科技股份有限公司 | Method, system, computer device and storage medium for automatically acquiring news headlines |
CN113033190A (en) * | 2021-04-19 | 2021-06-25 | 北京有竹居网络技术有限公司 | Subtitle generating method, device, medium and electronic equipment |
CN113191133A (en) * | 2021-04-21 | 2021-07-30 | 北京邮电大学 | Audio text alignment method and system based on Doc2Vec |
CN113159034A (en) * | 2021-04-23 | 2021-07-23 | 杭州电子科技大学 | Method and system for automatically generating subtitles by using short video |
Non-Patent Citations (1)
Title |
---|
王新: "教学视频与习题的相似度分析方法", 《中国优秀硕士论文全文数据库信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114222193B (en) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11776267B2 (en) | Intelligent cataloging method for all-media news based on multi-modal information fusion understanding | |
CN110245259B (en) | Video labeling method and device based on knowledge graph and computer readable medium | |
CN111582241B (en) | Video subtitle recognition method, device, equipment and storage medium | |
CN107305541B (en) | Method and device for segmenting speech recognition text | |
CN110083741B (en) | Character-oriented video abstract extraction method based on text and image combined modeling | |
US8433136B2 (en) | Tagging video using character recognition and propagation | |
CN107341143B (en) | Sentence continuity judgment method and device and electronic equipment | |
CN106980664B (en) | Bilingual comparable corpus mining method and device | |
CN103761261A (en) | Voice recognition based media search method and device | |
US20200364463A1 (en) | Intelligently generating digital note compilations from digital video | |
CN112382295B (en) | Speech recognition method, device, equipment and readable storage medium | |
Tensmeyer et al. | Training full-page handwritten text recognition models without annotated line breaks | |
CN104182381A (en) | character input method and system | |
CN114357206A (en) | Education video color subtitle generation method and system based on semantic analysis | |
CN113992944A (en) | Video cataloging method, device, equipment, system and medium | |
CN111695054A (en) | Text processing method and device, information extraction method and system, and medium | |
AlMousa et al. | Nlp-enriched automatic video segmentation | |
US8170289B1 (en) | Hierarchical alignment of character sequences representing text of same source | |
CN117290542A (en) | Video question-answering method, computer device and storage medium | |
Soykan et al. | A comprehensive gold standard and benchmark for comics text detection and recognition | |
Rasheed et al. | A deep learning-based method for Turkish text detection from videos | |
CN114222193B (en) | Video subtitle time alignment model training method and system | |
CN115396690A (en) | Audio and text combination method and device, electronic equipment and storage medium | |
CN116229313A (en) | Label construction model generation method and device, electronic equipment and storage medium | |
CN116011443A (en) | File element information identification method and device based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |