CN114222193A - Video subtitle time alignment model training method and system - Google Patents

Video subtitle time alignment model training method and system Download PDF

Info

Publication number
CN114222193A
CN114222193A CN202111470819.0A CN202111470819A CN114222193A CN 114222193 A CN114222193 A CN 114222193A CN 202111470819 A CN202111470819 A CN 202111470819A CN 114222193 A CN114222193 A CN 114222193A
Authority
CN
China
Prior art keywords
recognition result
ocr recognition
paragraph
text
description text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111470819.0A
Other languages
Chinese (zh)
Other versions
CN114222193B (en
Inventor
程梓益
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Moviebook Science And Technology Co ltd
Original Assignee
Beijing Moviebook Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moviebook Science And Technology Co ltd filed Critical Beijing Moviebook Science And Technology Co ltd
Priority to CN202111470819.0A priority Critical patent/CN114222193B/en
Publication of CN114222193A publication Critical patent/CN114222193A/en
Application granted granted Critical
Publication of CN114222193B publication Critical patent/CN114222193B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

The application discloses a method and a system for training a video subtitle time alignment model, wherein the method comprises the steps of firstly, obtaining an original video set with subtitles and a description text set; matching the original video set with the corresponding description text set sequentially through a common substring algorithm to determine an OCR recognition result corresponding to each paragraph in the description text set; forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set; and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain the trained video subtitle time alignment model. Therefore, the video subtitle time alignment model provided by the embodiment of the application solves the problem of video subtitle time matching caused by the existence of wrongly written characters, rarely written characters and video background interference, and is more accurate compared with the existing common substring algorithm.

Description

Video subtitle time alignment model training method and system
Technical Field
The invention relates to the technical field of multimedia, in particular to a method and a system for training a video subtitle time alignment model.
Background
With the continuous development of internet technology and multimedia technology, video is popular among many users as one of information carriers. In order to better display video content, subtitles corresponding to video are generally displayed simultaneously when a user watches the video, and description text corresponding to the video subtitles also exists, however, the description text is generally divided into several or even more than ten text segments.
In the prior art, when paragraphs in description texts are time-matched with video subtitles, a common method is to use OCR to recognize characters of a current frame in a video, record the current time, and then match the current time with a corresponding text, but the common method cannot automatically complete the task because of wrongly written characters, the existence of uncommon words and the interference of a video background.
Disclosure of Invention
Based on this, the embodiment of the application provides a video subtitle time alignment model training method and system, which can improve the accuracy of time matching between a video subtitle and a description text.
In a first aspect, a method for training a video subtitle time alignment model is provided, where the method includes:
acquiring an original video set with subtitles and a description text set, wherein the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
matching an original video set with a corresponding description text set sequentially through a common substring algorithm, and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;
forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;
and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.
Optionally, training the video subtitle time alignment model by using the training set to obtain a trained video subtitle time alignment model, including:
inputting each text segment and an OCR recognition result into the deep language model respectively, and processing to obtain a first text vector and a second text vector;
and splicing the first text vector and the second text vector, inputting the spliced first text vector and second text vector into a multilayer perceptron to obtain a training result of the current turn, comparing the training result of the current turn with the labeling information, adjusting model parameters according to the comparison result, and obtaining a trained video subtitle time alignment model when the difference between the model output result and the labeling result is smaller than a preset threshold value.
Optionally, the depth language model includes at least a BERT-chinese model or an ERNIE model.
Optionally, matching the original video set with the corresponding description text set sequentially through a common substring algorithm, and determining an OCR recognition result corresponding to each paragraph in the description text set, including:
acquiring an original video and a corresponding description text, wherein the content of the description text corresponds to the content of subtitles in the original video;
intercepting a subtitle region in the original video according to a preset frame taking interval time to obtain a subtitle region image set, wherein the subtitle region image set comprises a corresponding timestamp in the original video;
inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp;
matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph;
and determining the duration of each paragraph of the description text in the original video according to the time stamps corresponding to the first sentence and the last sentence of each paragraph, and matching the duration with the OCR recognition result with the time stamp.
Optionally, matching the OCR recognition result with each paragraph of the description text by using a common substring algorithm, and determining a first sentence of each paragraph of the OCR recognition result, including:
comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a first substring from the common substrings, wherein the first substring is used for representing a first continuous common substring;
when the first sub-string is in a starting character range in the target paragraph, performing character comparison on an OCR recognition result corresponding to the first sub-string and characters in the starting character range;
when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph;
and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.
Optionally, the character comparison between the OCR recognition result and the target paragraph to find out all the continuous common substrings, and after selecting the first substring, the method further includes:
and when the first substring is in the end character range of the target paragraph, taking the time stamp of the OCR recognition result corresponding to the first substring as the starting time of the next paragraph of the target paragraph.
Optionally, matching the OCR recognition result with each paragraph of the description text by a common substring algorithm, and determining a tail sentence of the OCR recognition result in each paragraph, including:
comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a tail substring from the common substrings, wherein the tail substring is used for representing the last continuous common substring;
when the tail string is in the ending character range in the target paragraph, performing character comparison on the OCR recognition result corresponding to the tail string and characters in the ending character range;
when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph;
and traversing each paragraph of the description text, and determining the tail sentence of the OCR recognition result in each paragraph.
Optionally, the character comparison between the OCR recognition result and the target paragraph to find out all the continuous common substrings, and after selecting the tail substring, the method further includes:
and when the tail substring is in the range of the initial characters in the target paragraph, taking the time stamp of the OCR recognition result corresponding to the tail substring as the end time of a section on the target paragraph.
Optionally, the description text includes wrongly written words and/or uncommon words.
In a second aspect, a video subtitle time alignment model training system is provided, which includes:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an original video set with subtitles and a description text set, the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
the determining module is used for matching the original video set with the corresponding description text set sequentially through a common substring algorithm and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;
the construction module is used for forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;
and the training module is used for constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.
In the technical scheme provided by the embodiment of the application, an original video set with subtitles and a description text set are obtained firstly, wherein the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set; matching the original video set with the corresponding description text set sequentially through a common substring algorithm to determine an OCR recognition result corresponding to each paragraph in the description text set; forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set; and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain the trained video subtitle time alignment model. Therefore, the video subtitle time alignment model provided by the embodiment of the application solves the problem of video subtitle time matching caused by the existence of wrongly written characters, rarely written characters and video background interference, and is more accurate compared with the existing common substring algorithm.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
Fig. 1 is a flowchart illustrating steps of a method for training a video subtitle time alignment model according to an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a result of a common substring algorithm provided in an embodiment of the present application;
fig. 3 is a schematic diagram of a video subtitle area according to an embodiment of the present application;
fig. 4 is a schematic diagram illustrating tagging of a data set according to an embodiment of the present application;
fig. 5 is a diagram of a BERT model structure provided in an embodiment of the present application;
fig. 6 is a diagram of a video subtitle time alignment model structure according to an embodiment of the present application;
fig. 7 is a block diagram of a video subtitle time alignment model training system according to an embodiment of the present application.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for aligning time by matching by using the existing common substring algorithm can only obtain 50% of accuracy (only 50 results can be finally obtained by inputting 100 videos) because too many wrongly written characters exist in the text during actual test, so that the application provides an improved method which can obtain better effect theoretically.
According to the technical scheme, a batch of videos and corresponding description texts are processed by using a common substring method, then manual review is carried out, failed cases are removed, and then each section of text and video OCR content corresponding to the section of text are used as a data set to train a text similarity model.
To facilitate understanding of the present embodiment, a detailed description is first given of a training method for a video subtitle time alignment model disclosed in the embodiments of the present application. Referring to fig. 1, a flowchart of a method for training a video subtitle time alignment model according to an embodiment of the present application is shown, where the method may include the following steps:
step 101, obtaining an original video set with subtitles and a description text set.
The original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding descriptive text in the set of descriptive text.
In the embodiment of the application, a batch of videos (more than 100) and corresponding description texts are input, the text content corresponds to subtitles in the videos, but wrong characters exist in the texts, such as "marks" in the description texts which should correspond to "peaks" of the subtitles, and matching may be unsuccessful due to the existence of the wrong characters when matching is performed by using the existing text matching method. The text is divided into sections according to punctuation marks, and the number of characters in each section is not fixed.
And 102, matching the original video set with the corresponding description text set sequentially through a common substring algorithm, and determining an OCR (optical character recognition) result corresponding to each paragraph in the description text set.
The OCR recognition result is used for representing the subtitle content in the original video set; the batch of videos are processed by a common substring method, and after the failure cases are removed through manual review, the duration corresponding to each text segment is obtained, as shown in fig. 2.
Specifically, each original video and the corresponding description text are matched through a common substring algorithm to obtain OCR recognition results of each paragraph in the description text as follows:
step 1021, acquiring the original video and the corresponding description text.
Wherein the content of the description text corresponds to the content of the subtitles in the original video;
and step 1022, intercepting the subtitle region in the original video according to the preset frame taking interval time to obtain a subtitle region image set.
And step 1023, inputting the subtitle area image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp.
The subtitle region image set includes a corresponding timestamp in the original video, and the preset frame taking interval time may be one second.
In the embodiment of the application, a complete original video with subtitles is input, a frame is taken every second, as shown in fig. 3, a subtitle area is intercepted every frame, OCR recognition is input, for an output result of the OCR, whether Chinese is contained or not and the confidence degree is greater than 0.99 is checked, and all historical OCR results are saved for deduplication.
And step 1024, matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph.
In the embodiment of the application, the result of OCR recognition is matched with the current text paragraph, whether the OCR result is in the text or not needs to be checked, and a specific position needs to be determined.
The common substring algorithm principle is as follows: inputting a character string A and a character string B, and comparing each character in A with the character in B in turn to find out all continuous substrings. For example, input a ═ acccdc ', B ═ ACGSBCDEF', and output 'AC' and 'BCD'.
In the embodiment of the application, character comparison is carried out on an OCR recognition result and a target paragraph, all continuous common substrings are found out, a first substring is selected, and the first substring is used for representing a first continuous common substring; when the first substring is in the initial character range of the target paragraph, character comparison is carried out on the OCR recognition result corresponding to the first substring and characters in the initial character range; when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph; and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.
In an optional embodiment, when the OCR recognition result is compared with the target paragraph to find out all the continuous common substrings, and after the first substring is selected, when the first substring is within the end character range of the target paragraph, the time stamp of the OCR recognition result corresponding to the first substring is used as the start time of the next paragraph of the target paragraph.
In the embodiment of the application, character comparison is carried out on an OCR recognition result and a target paragraph, all continuous common substrings are found out, and a tail substring is selected and used for representing the last continuous common substring; when the tail string is in the ending character range in the target paragraph, character comparison is carried out on the OCR recognition result corresponding to the tail string and characters in the ending character range; when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph; and traversing each paragraph of the description text to determine the tail sentence of each paragraph of the OCR recognition result.
In an optional embodiment, the OCR recognition result is compared with the target paragraph to find out all the continuous common substrings, and after the tail substring is selected, when the tail substring is within the range of the starting character in the target paragraph, the time stamp of the OCR recognition result corresponding to the tail substring is used as the ending time of the segment on the target paragraph.
And 1025, determining the duration of each paragraph of the description text in the original video according to the time stamp corresponding to the first sentence and the last sentence of each paragraph respectively, and matching the duration with the OCR recognition result with the time stamp.
And when the timestamp corresponding to the first sentence is overlapped with the timestamp corresponding to the last sentence, taking the duration after the time ranges are combined as an output result.
The following provides a flow of a video subtitle matching method based on a common substring algorithm according to an alternative embodiment of the present application, where a common substring algorithm is configured to set that when a starting character range and an ending character range are both 25 characters, a first sentence threshold and a last sentence threshold are both 4 characters:
(1) inputting a complete video, taking a frame every second, intercepting a caption area every frame, inputting OCR recognition, checking whether Chinese is contained in an output result of the OCR and the confidence coefficient is more than 0.99, and storing all historical OCR results for duplication removal.
(2) Matching the OCR recognition result with the current text paragraph requires checking if the OCR result is in the text and determining the specific location. Here a common sub-string algorithm is used for matching,
(3) for the output result of the common substring algorithm, only the first and the last substring are taken, then the positions of the two common substrings in the text are respectively searched, if the two substrings are not in the range of the first 25 characters or the last 25 characters of the text, the OCR result is considered to be useless, and the OCR result is discarded;
(4) if the first sub-string is in the range of 25 characters at the beginning of the text, the OCR result is considered to be possibly useful, the OCR result and the 25 characters at the beginning of the text are further subjected to common sub-string, then the first common sub-string is taken, and if the initial position of the first common sub-string in the 25 characters at the beginning of the text is less than 4, the OCR result is considered to be matched with the first sentence of the text;
(5) similarly to (4), if the last sub-string is within 25 characters of the end of the text, the OCR result is considered to be possibly useful, the OCR result and the 25 characters of the end of the text are further evaluated to be a common sub-string, then the last common sub-string is taken, and if the end distance from the end position of the text is less than 4, the OCR result is considered to be matched with the last sentence of the text;
(6) on the basis of (4) and (5), if the first sentence of the text is matched, recording the current time as the starting time, and if the last sentence of the text is matched, reading the next text; if the beginning of the second text is matched, the previous text is considered to be ended, and the current time is recorded as the ending time of the previous text and is also the starting time of the current text.
(7) And for the final result, performing a post-processing process, merging completely repeated contents, and merging a plurality of time ranges corresponding to the same text to obtain an output result.
And 103, forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set.
In the embodiment of the application, a specific labeling process includes making each piece of text and video OCR content corresponding to the piece of text into a semantic similarity data set, where the data set is divided into three columns, the first column is an original text segment, the second column is a result of OCR recognition, the third column is a label, the label is 0 or 1, 0 represents dissimilarity, 1 represents similarity, and each column is separated by a tab symbol.
The labeling process for a particular data set is such that for an original text segment, all OCR recognition results within its duration are placed as a sentence in the second column, then labeled as 1 in the third column, and then two text segments before and after the duration of the text segment are labeled as 0, so that for each text segment, the labeled data shown in fig. 4 can be obtained.
And 104, constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using a training set to obtain a trained video subtitle time alignment model.
In the embodiment of the present application, after each data set is available, the data set can be used for training a model, and any depth language model pre-trained in chinese can be used, such as BERT-chinese, ERNIE, etc., taking BERT as an example, and the model structure is shown in fig. 5.
Where [ cls ] is the beginning of each sentence, Tok1- - -TokN represents all characters in a sentence, E1- - -En is the coding vector of each character, the coding vector is character coding + position coding, and the character coding can use the existing word2vec, or can be directly initialized to any vector at random.
The position-coding PE is calculated by:
Figure BDA0003392052440000101
Figure BDA0003392052440000102
wherein pos represents a position index, for example, 0 ≦ pos ≦ 255 if the length of the input text is 256; dmodelRepresenting the vector dimension of the model, i representing a dimension index, such as the model vector dimension dmodelWhen the value is 512, i is more than or equal to 0 and less than or equal to 511.
The output after BERT, C is the vector representing the whole sentence, T1-Tn is the vector representing Tok 1-Tokn, and here only the vector C capable of representing the whole sentence needs to be used.
Wherein the process of each round of training specifically comprises:
inputting each text segment and an OCR recognition result into the deep language model respectively, and processing to obtain a first text vector and a second text vector; and splicing the first text vector and the second text vector, inputting the spliced first text vector and second text vector into a multilayer perceptron to obtain a training result of the current turn, comparing the training result of the current turn with the labeling information, adjusting model parameters according to the comparison result, and obtaining a trained video subtitle time alignment model when the difference between the model output result and the labeling result is smaller than a preset threshold value.
Specifically, as shown in fig. 6, a sentence a and a sentence B are input to obtain a vector u of the sentence a and a vector v of the sentence B, the two vectors are spliced and input to a classifier, the classifier is a multi-layer perceptron, the spliced sentence vector is input and output 0 or 1, the training process of the model is that the final output is consistent with the labeling result through back propagation, and after the model is trained in such a way, the model can replace the existing common substring algorithm, and better accuracy can be obtained on video caption time alignment.
Referring to fig. 7, a block diagram of a video subtitle time alignment model training system 200 according to an embodiment of the present application is shown. As shown in fig. 7, the system 200 may include: the system comprises an acquisition module 201, a determination module 202, a construction module 203 and a training module 204.
An obtaining module 201, configured to obtain an original video set with subtitles and a description text set, where the original video set includes multiple original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
the determining module 202 is configured to match the original video set with the corresponding description text set sequentially through a common substring algorithm, and determine an OCR recognition result corresponding to each paragraph in the description text set; OCR recognition results are used for representing the subtitle content in the original video set;
the building module 203 is configured to form a data set according to each segment of text and an OCR recognition result corresponding to the segment of text, and label the data set to obtain a training set;
the training module 204 is configured to construct a video subtitle time alignment model based on text semantic similarity matching, and train the video subtitle time alignment model by using a training set to obtain a trained video subtitle time alignment model.
For specific limitations of the video subtitle time alignment model training system, reference may be made to the above limitations of the video subtitle time alignment model training method, and details are not repeated here. All or part of the modules in the video subtitle time alignment model training system can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for training a video subtitle time alignment model, the method comprising:
acquiring an original video set with subtitles and a description text set, wherein the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
matching an original video set with a corresponding description text set sequentially through a common substring algorithm, and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;
forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;
and constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.
2. The method of claim 1, wherein training the video caption time alignment model using the training set to obtain a trained video caption time alignment model comprises:
inputting each text segment and an OCR recognition result into the deep language model respectively, and processing to obtain a first text vector and a second text vector;
and splicing the first text vector and the second text vector, inputting the spliced first text vector and second text vector into a multilayer perceptron to obtain a training result of the current turn, comparing the training result of the current turn with the labeling information, adjusting model parameters according to the comparison result, and obtaining a trained video subtitle time alignment model when the difference between the model output result and the labeling result is smaller than a preset threshold value.
3. The method of claim 2, wherein the depth language model comprises at least a BERT-chinese model or an ERNIE model.
4. The method of claim 1, wherein the step of sequentially matching the original video set with the corresponding description text set by a common substring algorithm to determine the OCR recognition result corresponding to each paragraph in the description text set comprises:
acquiring an original video and a corresponding description text, wherein the content of the description text corresponds to the content of subtitles in the original video;
intercepting a subtitle region in the original video according to a preset frame taking interval time to obtain a subtitle region image set, wherein the subtitle region image set comprises a corresponding timestamp in the original video;
inputting the subtitle region image set into an OCR recognition model for OCR recognition to obtain an OCR recognition result with a time stamp;
matching the OCR recognition result with each paragraph of the description text through a common substring algorithm, and determining a head sentence and a tail sentence of the OCR recognition result in each paragraph;
and determining the duration of each paragraph of the description text in the original video according to the time stamps corresponding to the first sentence and the last sentence of each paragraph, and matching the duration with the OCR recognition result with the time stamp.
5. The method of claim 4, wherein matching the OCR recognition result with each paragraph of the description text by a common substring algorithm to determine a first sentence of each paragraph of the OCR recognition result comprises:
comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a first substring from the common substrings, wherein the first substring is used for representing a first continuous common substring;
when the first sub-string is in a starting character range in the target paragraph, performing character comparison on an OCR recognition result corresponding to the first sub-string and characters in the starting character range;
when the substring obtained by character comparison is smaller than the first sentence threshold value, the substring obtained by current character comparison is used as the first sentence in the target paragraph;
and traversing each paragraph of the description text, and determining the first sentence of each paragraph of the OCR recognition result.
6. The method of claim 5, wherein character-comparing the OCR recognition result with the target paragraph to find all the consecutive common substrings, and after selecting the first substring, further comprising:
and when the first substring is in the end character range of the target paragraph, taking the time stamp of the OCR recognition result corresponding to the first substring as the starting time of the next paragraph of the target paragraph.
7. The method of claim 4, wherein matching the OCR recognition result with each paragraph of the description text by a common substring algorithm to determine a final sentence of each paragraph of the OCR recognition result comprises:
comparing the OCR recognition result with the target paragraph to find out all continuous common substrings, and selecting a tail substring from the common substrings, wherein the tail substring is used for representing the last continuous common substring;
when the tail string is in the ending character range in the target paragraph, performing character comparison on the OCR recognition result corresponding to the tail string and characters in the ending character range;
when the substring obtained by character comparison is smaller than the clause threshold, the substring obtained by current character comparison is used as the clause in the target paragraph;
and traversing each paragraph of the description text, and determining the tail sentence of the OCR recognition result in each paragraph.
8. The method of claim 7, wherein character-comparing the OCR recognition result with the target paragraph to find all the consecutive common substrings, and after selecting the tail substring, further comprising:
and when the tail substring is in the range of the initial characters in the target paragraph, taking the time stamp of the OCR recognition result corresponding to the tail substring as the end time of a section on the target paragraph.
9. The method according to claim 1, wherein the description text includes wrongly written words and/or uncommon words.
10. A video subtitle temporal alignment model training system, the system comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an original video set with subtitles and a description text set, the original video set comprises a plurality of original videos, and each original video corresponds to one description text in the description text set; the content of the subtitles in the original video corresponds to the content of the corresponding description text in the description text set;
the determining module is used for matching the original video set with the corresponding description text set sequentially through a common substring algorithm and determining an OCR recognition result corresponding to each paragraph in the description text set; the OCR recognition result is used for representing the subtitle content in the original video set;
the construction module is used for forming a data set according to each section of text and an OCR recognition result corresponding to the section of text, and labeling the data set to obtain a training set;
and the training module is used for constructing a video subtitle time alignment model based on text semantic similarity matching, and training the video subtitle time alignment model by using the training set to obtain the trained video subtitle time alignment model.
CN202111470819.0A 2021-12-03 2021-12-03 Video subtitle time alignment model training method and system Active CN114222193B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111470819.0A CN114222193B (en) 2021-12-03 2021-12-03 Video subtitle time alignment model training method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111470819.0A CN114222193B (en) 2021-12-03 2021-12-03 Video subtitle time alignment model training method and system

Publications (2)

Publication Number Publication Date
CN114222193A true CN114222193A (en) 2022-03-22
CN114222193B CN114222193B (en) 2024-01-05

Family

ID=80699646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111470819.0A Active CN114222193B (en) 2021-12-03 2021-12-03 Video subtitle time alignment model training method and system

Country Status (1)

Country Link
CN (1) CN114222193B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060253439A1 (en) * 2005-05-09 2006-11-09 Liwei Ren Matching engine for querying relevant documents
CN105338419A (en) * 2015-10-29 2016-02-17 网易传媒科技(北京)有限公司 Subtitle collection generating method and apparatus
CN106210840A (en) * 2016-06-29 2016-12-07 网易传媒科技(北京)有限公司 A kind of text display method and equipment
CN106604125A (en) * 2016-12-29 2017-04-26 北京奇艺世纪科技有限公司 Video subtitle determining method and video subtitle determining device
CN108259963A (en) * 2018-03-19 2018-07-06 成都星环科技有限公司 A kind of TV ends player
CN108683924A (en) * 2018-05-30 2018-10-19 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
US20190037168A1 (en) * 2016-03-15 2019-01-31 Sony Corporation Transmission device, transmission method, reception device and reception method
CN111193878A (en) * 2020-01-03 2020-05-22 北京字节跳动网络技术有限公司 Multimedia text information processing method, device, medium and electronic equipment
CN111309200A (en) * 2020-01-17 2020-06-19 百度在线网络技术(北京)有限公司 Method, device, equipment and storage medium for determining extended reading content
US20200320308A1 (en) * 2019-04-08 2020-10-08 Nedelco, Incorporated Identifying and tracking words in a video recording of captioning session
US20200322570A1 (en) * 2019-04-08 2020-10-08 Baidu.Com Times Technology (Beijing) Co., Ltd. Method and apparatus for aligning paragraph and video
CN111931775A (en) * 2020-09-28 2020-11-13 成都索贝数码科技股份有限公司 Method, system, computer device and storage medium for automatically acquiring news headlines
CN112084788A (en) * 2020-08-19 2020-12-15 北京影谱科技股份有限公司 Automatic marking method and system for implicit emotional tendency of image captions
CN113033190A (en) * 2021-04-19 2021-06-25 北京有竹居网络技术有限公司 Subtitle generating method, device, medium and electronic equipment
CN113159034A (en) * 2021-04-23 2021-07-23 杭州电子科技大学 Method and system for automatically generating subtitles by using short video
CN113191133A (en) * 2021-04-21 2021-07-30 北京邮电大学 Audio text alignment method and system based on Doc2Vec

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060253439A1 (en) * 2005-05-09 2006-11-09 Liwei Ren Matching engine for querying relevant documents
CN105338419A (en) * 2015-10-29 2016-02-17 网易传媒科技(北京)有限公司 Subtitle collection generating method and apparatus
US20190037168A1 (en) * 2016-03-15 2019-01-31 Sony Corporation Transmission device, transmission method, reception device and reception method
CN106210840A (en) * 2016-06-29 2016-12-07 网易传媒科技(北京)有限公司 A kind of text display method and equipment
CN106604125A (en) * 2016-12-29 2017-04-26 北京奇艺世纪科技有限公司 Video subtitle determining method and video subtitle determining device
CN108259963A (en) * 2018-03-19 2018-07-06 成都星环科技有限公司 A kind of TV ends player
CN108683924A (en) * 2018-05-30 2018-10-19 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
US20200322570A1 (en) * 2019-04-08 2020-10-08 Baidu.Com Times Technology (Beijing) Co., Ltd. Method and apparatus for aligning paragraph and video
US20200320308A1 (en) * 2019-04-08 2020-10-08 Nedelco, Incorporated Identifying and tracking words in a video recording of captioning session
CN111193878A (en) * 2020-01-03 2020-05-22 北京字节跳动网络技术有限公司 Multimedia text information processing method, device, medium and electronic equipment
CN111309200A (en) * 2020-01-17 2020-06-19 百度在线网络技术(北京)有限公司 Method, device, equipment and storage medium for determining extended reading content
CN112084788A (en) * 2020-08-19 2020-12-15 北京影谱科技股份有限公司 Automatic marking method and system for implicit emotional tendency of image captions
CN111931775A (en) * 2020-09-28 2020-11-13 成都索贝数码科技股份有限公司 Method, system, computer device and storage medium for automatically acquiring news headlines
CN113033190A (en) * 2021-04-19 2021-06-25 北京有竹居网络技术有限公司 Subtitle generating method, device, medium and electronic equipment
CN113191133A (en) * 2021-04-21 2021-07-30 北京邮电大学 Audio text alignment method and system based on Doc2Vec
CN113159034A (en) * 2021-04-23 2021-07-23 杭州电子科技大学 Method and system for automatically generating subtitles by using short video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王新: "教学视频与习题的相似度分析方法", 《中国优秀硕士论文全文数据库信息科技辑》 *

Also Published As

Publication number Publication date
CN114222193B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
US11776267B2 (en) Intelligent cataloging method for all-media news based on multi-modal information fusion understanding
CN110245259B (en) Video labeling method and device based on knowledge graph and computer readable medium
CN111582241B (en) Video subtitle recognition method, device, equipment and storage medium
CN107305541B (en) Method and device for segmenting speech recognition text
CN110083741B (en) Character-oriented video abstract extraction method based on text and image combined modeling
US8433136B2 (en) Tagging video using character recognition and propagation
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN106980664B (en) Bilingual comparable corpus mining method and device
CN103761261A (en) Voice recognition based media search method and device
US20200364463A1 (en) Intelligently generating digital note compilations from digital video
CN112382295B (en) Speech recognition method, device, equipment and readable storage medium
Tensmeyer et al. Training full-page handwritten text recognition models without annotated line breaks
CN104182381A (en) character input method and system
CN114357206A (en) Education video color subtitle generation method and system based on semantic analysis
CN113992944A (en) Video cataloging method, device, equipment, system and medium
CN111695054A (en) Text processing method and device, information extraction method and system, and medium
AlMousa et al. Nlp-enriched automatic video segmentation
US8170289B1 (en) Hierarchical alignment of character sequences representing text of same source
CN117290542A (en) Video question-answering method, computer device and storage medium
Soykan et al. A comprehensive gold standard and benchmark for comics text detection and recognition
Rasheed et al. A deep learning-based method for Turkish text detection from videos
CN114222193B (en) Video subtitle time alignment model training method and system
CN115396690A (en) Audio and text combination method and device, electronic equipment and storage medium
CN116229313A (en) Label construction model generation method and device, electronic equipment and storage medium
CN116011443A (en) File element information identification method and device based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant