CN116634223A - Subtitle extraction method based on video text merging, filtering and classifying - Google Patents
Subtitle extraction method based on video text merging, filtering and classifying Download PDFInfo
- Publication number
- CN116634223A CN116634223A CN202310579487.2A CN202310579487A CN116634223A CN 116634223 A CN116634223 A CN 116634223A CN 202310579487 A CN202310579487 A CN 202310579487A CN 116634223 A CN116634223 A CN 116634223A
- Authority
- CN
- China
- Prior art keywords
- text
- text box
- video
- boxes
- subtitle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001914 filtration Methods 0.000 title claims abstract description 27
- 238000000605 extraction Methods 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000013145 classification model Methods 0.000 claims abstract description 15
- 238000010801 machine learning Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 24
- 238000010586 diagram Methods 0.000 claims description 16
- 238000012015 optical character recognition Methods 0.000 claims description 14
- 238000001514 detection method Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000003287 optical effect Effects 0.000 abstract 1
- 238000010276 construction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000007429 general method Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000002301 combined effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44016—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/454—Content or additional data filtering, e.g. blocking advertisements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/454—Content or additional data filtering, e.g. blocking advertisements
- H04N21/4545—Input to filtering algorithms, e.g. filtering a region of the image
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4662—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
- H04N21/4665—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms involving classification methods, e.g. Decision trees
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8126—Monomedia components thereof involving additional data, e.g. news, sports, stocks, weather forecasts
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computer Graphics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Geometry (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Character Input (AREA)
Abstract
The invention discloses a subtitle extraction method based on video text merging, filtering and classifying, which comprises the steps of extracting frames from a video, and identifying texts in all video frames by using optical characters to obtain a text box set in the video; combining and filtering the text box set in the video according to text content, text box coordinates, text appearance time and the like; predicting whether each text box is a subtitle or not by using a subtitle classification model based on machine learning, and storing the text determined to be the subtitle type and the position information thereof as the subtitle information of the video. The method of the invention can preliminarily filter most texts which do not belong to the caption type by merging and filtering the text boxes, and can further determine the text box type by constructing a machine learning caption classification model. The method can solve the problem of changeable positions of the existing Internet video subtitles without defining subtitle areas.
Description
Technical Field
The invention relates to the technical field of computers, in particular to the technical field of computer vision technology and machine learning, and particularly relates to a subtitle extraction method based on video text merging, filtering and classifying.
Background
With the development of self-media and electronic commerce, the number of short video authoring has increased dramatically. The subtitle information in the video is extracted, so that the subtitle information can be further used in combination with a large-scale semantic understanding model, text dimension is added to assist in video understanding on the basis of image and video picture understanding, the subtitle information can be used for a plurality of other scenes, such as multi-language translation during transnational propagation, and a subtitle-free video is identified for subtitle generation and the like.
Subtitles are of different kinds, such as dominant subtitles, captions, talking subtitles, etc. In the existing subtitle extraction technology, the common method is to set a subtitle region first, then use an optical character recognition technology to extract the characters in a specific region, and the method is more suitable for videos with certain rules in subtitle positions such as film and television programs. In addition, there is a multi-modal subtitle extraction model based on voice recognition and optical character recognition, and the multi-modal model mainly extracts talking subtitles, and although the accuracy of subtitle text can be effectively increased, the multi-modal subtitle extraction model is difficult to be applied to subtitle types without voice accompanying.
The caption positions in the short video are changeable, and a general template cannot be used for extracting caption information, so how to use a general method for extracting multiple types of captions is a problem to be solved.
Disclosure of Invention
The invention aims to solve the technical problems of overcoming the defects of the prior art and solving the pertinence and universality problems of a general subtitle extraction method in the short video field, and provides a subtitle extraction method based on video text merging, filtering and classifying.
In order to solve the technical problems, the invention provides a subtitle extraction method based on video text merging, filtering and classifying, which is characterized by comprising the following steps:
step 1, extracting frames from a video at preset time intervals, performing text detection and recognition on frame images, including detecting texts in all frame images by using an Optical Character Recognition (OCR) technology to form text boxes, and counting multi-dimensional features according to text box information and video time axis information, wherein the multi-dimensional features comprise text content, vertex coordinates and appearance time of each text box to obtain a first text box set;
step 2, for the text boxes in the first text box set, performing text merging by using text box information of each text box in a single frame image and among a plurality of continuous frame images to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to occurrence time of the merged text boxes;
step 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes;
step 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set;
step 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle;
and 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video.
In the step 1, the vertex coordinates include x-axis coordinates and y-axis coordinates; in the step 2, obtaining the second text box set includes:
step 2-1, aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule; updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box as the character height of the combined text;
and 2-2, calculating the text similarity and the text box intersection ratio of adjacent frames aiming at the text boxes of all frames in a single video, and merging the text boxes meeting a preset intersection ratio threshold.
In the step 3, the method for filtering the second text box set includes:
step 3-1, deleting the text boxes with the duration exceeding a preset duration threshold from the second text box set according to the duration list;
deleting the text boxes exceeding a preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates;
calculating the inclination angle of the text box according to the vertex coordinates of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set;
presetting a character quantity threshold, and deleting the text boxes which do not meet the character quantity threshold from the second text box set.
In the step 4, determining the classification characteristic of each text box includes:
calculating a median of durations of individual characters in all text boxes in the second set of text boxes;
calculating the absolute value of the difference between the duration time and the median of the duration time of each text box as a feature I;
calculating a median by using the character heights of all text boxes in the second text box set, and normalizing the median according to the video pixel heights;
calculating the absolute value of the difference between the character height of each text box and the median of the character heights of all the text boxes to obtain a second characteristic;
calculating a thermodynamic diagram of a text region according to the coordinate position and duration of the text boxes, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all text boxes in the second text box set;
calculating the absolute value of the difference between the thermodynamic average value of each text box and the thermodynamic average value median of the whole video, and taking the absolute value as a third characteristic;
and combining the first feature, the second feature and the third feature into a feature vector of each text box.
In the step 5, the caption classification model is based on an adaboost algorithm, and the construction method of the caption classification model comprises the following steps:
collecting a plurality of videos for training, and extracting classification features from text boxes in a second text box set through the processing of the steps 1-4, wherein the three-dimensional feature of each text box is used as a training sample;
labeling whether each training sample is a subtitle or not;
and inputting training samples, taking the labels as expected outputs, and training a subtitle classification model for judging the subtitles in the video.
In the step 2-1, the decision rule includes:
calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value;
calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value;
and calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value.
In the step 2-1, the first pixel value is 6 pixels; the second pixel value is 20 pixels; the third pixel value is 80 pixels.
The beneficial effects of the scheme are that: the method for detecting, identifying, combining, filtering and classifying the text information in the video frame images, in particular to a method for combining and filtering texts, extracting classification characteristics and constructing a caption classification model, achieves the effect of automatically detecting and extracting caption information in videos.
Drawings
FIG. 1 is a schematic flow diagram of a method of an exemplary embodiment of the present invention;
FIG. 2 is a schematic view of OCR recognition results according to an exemplary embodiment of the present invention;
FIG. 3 is a diagram of text merge results according to an exemplary embodiment of the present invention;
FIG. 4 is a flow chart of a text box merge method in accordance with an exemplary embodiment of the present invention;
FIG. 5 is a flow chart of a text box filtering method according to an exemplary embodiment of the invention;
FIG. 6 is a schematic diagram of a text box classification feature extraction flow in accordance with an exemplary embodiment of the invention;
FIG. 7 is a thermal schematic diagram of a text box according to an exemplary embodiment of the invention;
fig. 8 is a schematic diagram of a machine learning subtitle classification model construction flow according to an exemplary embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention, examples of which are illustrated in the accompanying drawings, wherein the embodiments described are by way of illustration only and not by way of limitation.
The general method flow diagram of the embodiment of the invention shown in fig. 1 mainly comprises:
and step 1, determining a time interval, extracting frames from the video, and performing text detection and recognition. One frame of image per second may be extracted from the target video, and for each frame of image, text within all video frame images is detected using Optical Character Recognition (OCR) techniques to form a text box, the text box being schematically shown in fig. 2, comprising four differently located text blocks of text 1, text 2, text 3 and text 4 prior to text detection using OCR techniques. OCR techniques first detect the position of each text block in an image and predict a corresponding tilted or non-tilted detection box for each text block. Thus, after text detection, text 1, text 2, text 3 and text 4 with text boxes appear in the image, wherein the text boxes of text 4 are tilted.
And then counting multi-dimensional characteristics according to the text box information and the video time axis information, wherein the multi-dimensional characteristics comprise text content, vertex coordinates and appearance time of each text box, and the multi-dimensional characteristics are used for obtaining a first text box set.
Specifically, the optical character recognition model used can recognize oblique text or irregularly arranged text, and can recognize only a single line of text, and the recognition result includes text content and four vertex coordinates of an irregularly rectangular text box.
Specifically, the appearance time refers to how many seconds this text box appears in the video.
And 2, for the text boxes in the first text box set, performing text merging by using the text box information of each text box in a single video frame and among a plurality of continuous frames to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to the occurrence time of the merged text boxes.
The text merging result is shown in fig. 3, and in fig. 3 (1), the texts 2 and 3 with similar positions and same size are merged into the same text box.
The text box merging method is shown in fig. 4, and the merging method for obtaining the second text box set mainly comprises the following steps:
aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule; updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box as the character height of the combined text;
the decision rule includes a decision that the following are simultaneously satisfied:
calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value, and the first pixel value is 6 pixels, so that the operation can ensure that the fonts of the combined text boxes are similar in size;
calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value, and the second pixel value is 20 pixels, so that the combined text box can be guaranteed to be close in the y-axis direction;
calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value, and the third pixel value is 80 pixels, so that the combined text box can be ensured to be approximately aligned in the transverse direction.
In fig. 3 (2), only text 1, text 2, and text 3 satisfy the above 3 rules at the same time, and may be combined into a text box, while text 4 satisfies the first 2 rules, and the combining operation cannot be performed.
Specifically, since the optical character recognition model can only recognize a single line of text, the text boxes in each video frame are combined first, so as to combine the text boxes belonging to the same text content in each second.
Specifically, text boxes belonging to the same text content in each second generally have the characteristics of the same font size, the same height adjacent in the picture, the same ziegler-type character, and the like, so that text merging can be performed according to the height, y-axis coordinates and width median lines of the text boxes respectively.
Specifically, when the text contents are combined, the text contents are combined in sequence from top to bottom according to the y-axis coordinates of the text boxes, so that the correct reading sequence of the text contents can be maintained.
Specifically, when the text boxes are merged, four-point coordinates of the merged text boxes are counted, and vertex coordinates of the merged text boxes are updated.
Specifically, when merging text boxes, the height of the original single line text box should be kept, which represents the character height of the merged text, and should be one of the multidimensional features of the merged text box.
And calculating the text similarity and the text box intersection ratio of adjacent frames aiming at the text boxes of all frames in a single video, and merging the text boxes meeting a preset intersection ratio threshold.
Specifically, text boxes for all frames in a single video are combined, so that text boxes with the same content are combined, and the duration and the position change of the text boxes are counted.
Specifically, since the optical character recognition result has a certain error rate, an appropriate text similarity method and threshold should be selected.
In particular, since close text content may appear for different subtitles within a video, it should be noted that text boxes are only merged in adjacent frames, the values within their duration list should be consecutive.
Specifically, there may be a rolling caption in the video picture, and non-later text in the video, and the positions of these text boxes will change, so a suitable intersection ratio threshold should be set, and the initial text box position and the offset of the final text box position are recorded.
And 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes.
The text box filtering method is shown in fig. 5, and the filtering method mainly comprises the following steps:
and deleting the text boxes with the duration exceeding the preset duration threshold from the second text box set according to the duration list.
Specifically, the resident text may be advertisement, watermark, etc., so that text boxes with too long duration can be removed through the duration list of the video.
And deleting the text boxes exceeding the preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates.
And calculating the inclination angle of the text box according to the coordinate position of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set.
In particular, since the subtitles are mostly not tilting text within the video, if text tilting occurs, such tilting text is highly probable to be non-late text within the video, as shown in fig. 3 (text 4 of 1), so a portion of the background text can be filtered out by the text box tilting pass.
Presetting a character quantity threshold, and deleting the text boxes which do not meet the character quantity threshold from the second text box set.
Specifically, if the detected text content is too short, text information which is possibly detected by non-later characters in the video, optical character recognition models or has no practical meaning should be removed.
And 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set.
As shown in fig. 6, the method for determining the feature vector of each text box in the filtered second text box set mainly includes:
the median duration of the individual characters in all text boxes in the second set of text boxes is calculated.
The absolute value of the difference between the single character duration of each text box and the median of the single character duration of the whole video is calculated as feature one.
Specifically, since the caption speed and the expression style of each video are different, the text box classification feature extracted according to the median can have better adaptability to the variation of the video style, and the following reasons for using the median are the same.
The median is calculated using the character heights of all text boxes in the second set of text boxes, and normalized according to the video pixel height.
Specifically, since the sizes of each video are different, the text box character height can be normalized by using the text box character height/the number of pixels of the video height, so that the feature scale in different videos is ensured to be the same.
And calculating the absolute value of the difference between the character height of each text box and the median of the character heights of all the text boxes, and taking the absolute value as a second characteristic.
And calculating a thermodynamic diagram of a text region of the whole video according to the text box positions and the durations, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all the text boxes in the second text box set. As shown in fig. 7, the thermodynamic diagram clearly shows that the frequency difference of characters in the video occurs at each position, wherein the legends with the numbers (1) and (2) are the case of no subtitle in the video, and the legend with the number (3) is the case of subtitle in the video.
Specifically, the calculation method of the text region thermodynamic diagram includes initializing a two-dimensional thermodynamic diagram matrix of 100 x 100, and setting a value to 0; traversing the text boxes in the second text box set in turn, adding a thermal value to the appearance position in each text box, wherein the thermal value calculating method is the duration of the text boxes/the duration of all the text boxes.
Specifically, the thermodynamic average value calculation method of the area where the text box is located is to calculate the average value of the corresponding thermodynamic diagram area according to the appearance position in the text box.
And calculating the absolute value of the difference between the thermal average value of each text box and the median of the thermal average value of the whole video as a feature III.
And combining the first feature, the second feature and the third feature into a vector to be used as the classification feature of each text box.
And 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle.
As shown in fig. 8, a flow chart of machine learning caption classification model construction method mainly comprises:
and (3) collecting a plurality of videos for training, and extracting classification characteristics from text boxes in the second text box set through the processing of the steps 1-4, wherein the three-dimensional characteristics of each text box are used as a training sample.
In particular, the video without subtitles will also have more non-video post text, and thus also need to participate in training.
Labeling whether each training sample is a subtitle or not;
and inputting training samples, taking the labels as expected outputs, and training a subtitle classification model for judging the subtitles in the video.
Specifically, a plurality of machine learning models can be selected to check the training effect, the model based on the adaboost algorithm is selected for training, and the practice finds that the combined effect is optimal and obviously superior to the classification model based on other algorithms.
And 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video.
In particular, the contents in the third text box set should include text box text contents, vertex coordinates, and duration lists, which are helpful for the continuation of the subsequent translation work and video understanding work.
The beneficial effects of the scheme are that: the method for detecting, identifying, combining, filtering and classifying the text information in the video frame images, in particular to a method for combining and filtering texts, extracting classification characteristics and constructing a caption classification model, achieves the effect of automatically detecting and extracting caption information in videos.
The above embodiments do not limit the present invention in any way, and through the above description, the related workers can completely make various changes and modifications without departing from the scope of the technical idea of the present invention, and all other improvements and applications made to the above embodiments in equivalent transformation form belong to the protection scope of the present invention, and the technical scope of the present invention is not limited to the content on the description, and must be determined according to the scope of claims.
Claims (7)
1. A subtitle extraction method based on video text merging, filtering and classifying is characterized by comprising the following steps:
step 1, extracting frames from a video at preset time intervals, performing text detection and recognition on frame images, including detecting texts in all frame images by using an optical character recognition technology to form text boxes, and counting multi-dimensional features according to text box information and video time axis information, wherein the multi-dimensional features comprise text content, vertex coordinates and appearance time of each text box to obtain a first text box set;
step 2, for the text boxes in the first text box set, performing text merging by using text box information of each text box in a single frame image and among a plurality of continuous frame images to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to occurrence time of the merged text boxes;
step 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes;
step 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set;
step 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle;
and 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video.
2. The method for extracting subtitles based on video text merging, filtering and classifying according to claim 1, wherein in the step 1, the vertex coordinates include x-axis coordinates and y-axis coordinates; in the step 2, obtaining the second text box set includes:
step 2-1, aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule; updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box as the character height of the combined text;
and 2-2, calculating the text similarity and the text box intersection ratio of adjacent frames aiming at the text boxes of all frames in a single video, and merging the text boxes meeting a preset intersection ratio threshold.
3. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 2, wherein in the step 3, the method for filtering the second text box set comprises:
step 3-1, deleting the text boxes with the duration exceeding a preset duration threshold from the second text box set according to the duration list;
deleting the text boxes exceeding a preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates;
calculating the inclination angle of the text box according to the vertex coordinates of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set;
presetting a character quantity threshold, and deleting the text boxes which do not meet the character quantity threshold from the second text box set.
4. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 3, wherein in the step 4, determining the classification characteristic of each text box comprises:
calculating a median of durations of individual characters in all text boxes in the second set of text boxes;
calculating the absolute value of the difference between the duration time and the median of the duration time of each text box as a feature I;
calculating a median by using the character heights of all text boxes in the second text box set, and normalizing the median according to the video pixel heights;
calculating the absolute value of the difference between the character height of each text box and the median of the character heights of all the text boxes to obtain a second characteristic;
calculating a thermodynamic diagram of a text region according to the coordinate position and duration of the text boxes, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all text boxes in the second text box set;
calculating the absolute value of the difference between the thermodynamic average value of each text box and the thermodynamic average value median of the whole video, and taking the absolute value as a third characteristic;
and combining the first feature, the second feature and the third feature into a feature vector of each text box.
5. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 4, wherein in the step 5, the subtitle classifying model is based on an adaboost algorithm, and the method for constructing the subtitle classifying model comprises the following steps:
collecting a plurality of videos for training, and extracting classification features from text boxes in a second text box set through the processing of the steps 1-4, wherein the three-dimensional feature of each text box is used as a training sample;
labeling whether each training sample is a subtitle or not;
and inputting training samples, taking the labels as expected outputs, and training a subtitle classification model for judging the subtitles in the video.
6. The method for extracting subtitles based on video text merging, filtering and classifying according to claim 5, wherein in the step 2-1, the decision rule comprises:
calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value;
calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value;
and calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value.
7. The method for extracting subtitle based on video text merging, filtering and classifying as defined in claim 6, wherein in the step 2-1, the first pixel value is 6 pixels; the second pixel value is 20 pixels; the third pixel value is 80 pixels.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310579487.2A CN116634223A (en) | 2023-05-22 | 2023-05-22 | Subtitle extraction method based on video text merging, filtering and classifying |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310579487.2A CN116634223A (en) | 2023-05-22 | 2023-05-22 | Subtitle extraction method based on video text merging, filtering and classifying |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116634223A true CN116634223A (en) | 2023-08-22 |
Family
ID=87591361
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310579487.2A Pending CN116634223A (en) | 2023-05-22 | 2023-05-22 | Subtitle extraction method based on video text merging, filtering and classifying |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116634223A (en) |
-
2023
- 2023-05-22 CN CN202310579487.2A patent/CN116634223A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6731788B1 (en) | Symbol Classification with shape features applied to neural network | |
US6614930B1 (en) | Video stream classifiable symbol isolation method and system | |
JP2940936B2 (en) | Tablespace identification method | |
CN101453575B (en) | Video subtitle information extracting method | |
EP2034426A1 (en) | Moving image analyzing, method and system | |
Yang et al. | Lecture video indexing and analysis using video ocr technology | |
CN104298982A (en) | Text recognition method and device | |
JP2006067585A (en) | Method and apparatus for specifying position of caption in digital image and extracting thereof | |
CN110196917B (en) | Personalized LOGO format customization method, system and storage medium | |
Chen et al. | Text area detection from video frames | |
CN111626145A (en) | Simple and effective incomplete form identification and page-crossing splicing method | |
CN114495141A (en) | Document paragraph position extraction method, electronic equipment and storage medium | |
CN107368826A (en) | Method and apparatus for text detection | |
Gui et al. | A fast caption detection method for low quality video images | |
CN110378337B (en) | Visual input method and system for drawing identification information of metal cutting tool | |
CN116634223A (en) | Subtitle extraction method based on video text merging, filtering and classifying | |
JP3544324B2 (en) | CHARACTER STRING INFORMATION EXTRACTION DEVICE AND METHOD, AND RECORDING MEDIUM CONTAINING THE METHOD | |
KR19990047501A (en) | How to extract and recognize news video subtitles | |
Vu et al. | Automatic extraction of text regions from document images by multilevel thresholding and k-means clustering | |
Chen et al. | Video-text extraction and recognition | |
CN113888758B (en) | Curved character recognition method and system based on complex scene | |
Al-Asadi et al. | Arabic-text extraction from video images | |
CN113449713B (en) | Method and device for cleaning training data of face detection model | |
Darahan et al. | Real-Time Page Extraction for Document Digitization | |
Dayananda et al. | A Comprehensive Study on Text Detection in Images and Videos |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |