CN116634223A - Subtitle extraction method based on video text merging, filtering and classifying - Google Patents

Subtitle extraction method based on video text merging, filtering and classifying Download PDF

Info

Publication number
CN116634223A
CN116634223A CN202310579487.2A CN202310579487A CN116634223A CN 116634223 A CN116634223 A CN 116634223A CN 202310579487 A CN202310579487 A CN 202310579487A CN 116634223 A CN116634223 A CN 116634223A
Authority
CN
China
Prior art keywords
text
text box
video
boxes
subtitle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310579487.2A
Other languages
Chinese (zh)
Inventor
贾馥玮
房鹏展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN202310579487.2A priority Critical patent/CN116634223A/en
Publication of CN116634223A publication Critical patent/CN116634223A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/454Content or additional data filtering, e.g. blocking advertisements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/454Content or additional data filtering, e.g. blocking advertisements
    • H04N21/4545Input to filtering algorithms, e.g. filtering a region of the image
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4665Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms involving classification methods, e.g. Decision trees
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8126Monomedia components thereof involving additional data, e.g. news, sports, stocks, weather forecasts
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Graphics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Geometry (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Character Input (AREA)

Abstract

The invention discloses a subtitle extraction method based on video text merging, filtering and classifying, which comprises the steps of extracting frames from a video, and identifying texts in all video frames by using optical characters to obtain a text box set in the video; combining and filtering the text box set in the video according to text content, text box coordinates, text appearance time and the like; predicting whether each text box is a subtitle or not by using a subtitle classification model based on machine learning, and storing the text determined to be the subtitle type and the position information thereof as the subtitle information of the video. The method of the invention can preliminarily filter most texts which do not belong to the caption type by merging and filtering the text boxes, and can further determine the text box type by constructing a machine learning caption classification model. The method can solve the problem of changeable positions of the existing Internet video subtitles without defining subtitle areas.

Description

Subtitle extraction method based on video text merging, filtering and classifying
Technical Field
The invention relates to the technical field of computers, in particular to the technical field of computer vision technology and machine learning, and particularly relates to a subtitle extraction method based on video text merging, filtering and classifying.
Background
With the development of self-media and electronic commerce, the number of short video authoring has increased dramatically. The subtitle information in the video is extracted, so that the subtitle information can be further used in combination with a large-scale semantic understanding model, text dimension is added to assist in video understanding on the basis of image and video picture understanding, the subtitle information can be used for a plurality of other scenes, such as multi-language translation during transnational propagation, and a subtitle-free video is identified for subtitle generation and the like.
Subtitles are of different kinds, such as dominant subtitles, captions, talking subtitles, etc. In the existing subtitle extraction technology, the common method is to set a subtitle region first, then use an optical character recognition technology to extract the characters in a specific region, and the method is more suitable for videos with certain rules in subtitle positions such as film and television programs. In addition, there is a multi-modal subtitle extraction model based on voice recognition and optical character recognition, and the multi-modal model mainly extracts talking subtitles, and although the accuracy of subtitle text can be effectively increased, the multi-modal subtitle extraction model is difficult to be applied to subtitle types without voice accompanying.
The caption positions in the short video are changeable, and a general template cannot be used for extracting caption information, so how to use a general method for extracting multiple types of captions is a problem to be solved.
Disclosure of Invention
The invention aims to solve the technical problems of overcoming the defects of the prior art and solving the pertinence and universality problems of a general subtitle extraction method in the short video field, and provides a subtitle extraction method based on video text merging, filtering and classifying.
In order to solve the technical problems, the invention provides a subtitle extraction method based on video text merging, filtering and classifying, which is characterized by comprising the following steps:
step 1, extracting frames from a video at preset time intervals, performing text detection and recognition on frame images, including detecting texts in all frame images by using an Optical Character Recognition (OCR) technology to form text boxes, and counting multi-dimensional features according to text box information and video time axis information, wherein the multi-dimensional features comprise text content, vertex coordinates and appearance time of each text box to obtain a first text box set;
step 2, for the text boxes in the first text box set, performing text merging by using text box information of each text box in a single frame image and among a plurality of continuous frame images to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to occurrence time of the merged text boxes;
step 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes;
step 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set;
step 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle;
and 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video.
In the step 1, the vertex coordinates include x-axis coordinates and y-axis coordinates; in the step 2, obtaining the second text box set includes:
step 2-1, aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule; updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box as the character height of the combined text;
and 2-2, calculating the text similarity and the text box intersection ratio of adjacent frames aiming at the text boxes of all frames in a single video, and merging the text boxes meeting a preset intersection ratio threshold.
In the step 3, the method for filtering the second text box set includes:
step 3-1, deleting the text boxes with the duration exceeding a preset duration threshold from the second text box set according to the duration list;
deleting the text boxes exceeding a preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates;
calculating the inclination angle of the text box according to the vertex coordinates of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set;
presetting a character quantity threshold, and deleting the text boxes which do not meet the character quantity threshold from the second text box set.
In the step 4, determining the classification characteristic of each text box includes:
calculating a median of durations of individual characters in all text boxes in the second set of text boxes;
calculating the absolute value of the difference between the duration time and the median of the duration time of each text box as a feature I;
calculating a median by using the character heights of all text boxes in the second text box set, and normalizing the median according to the video pixel heights;
calculating the absolute value of the difference between the character height of each text box and the median of the character heights of all the text boxes to obtain a second characteristic;
calculating a thermodynamic diagram of a text region according to the coordinate position and duration of the text boxes, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all text boxes in the second text box set;
calculating the absolute value of the difference between the thermodynamic average value of each text box and the thermodynamic average value median of the whole video, and taking the absolute value as a third characteristic;
and combining the first feature, the second feature and the third feature into a feature vector of each text box.
In the step 5, the caption classification model is based on an adaboost algorithm, and the construction method of the caption classification model comprises the following steps:
collecting a plurality of videos for training, and extracting classification features from text boxes in a second text box set through the processing of the steps 1-4, wherein the three-dimensional feature of each text box is used as a training sample;
labeling whether each training sample is a subtitle or not;
and inputting training samples, taking the labels as expected outputs, and training a subtitle classification model for judging the subtitles in the video.
In the step 2-1, the decision rule includes:
calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value;
calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value;
and calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value.
In the step 2-1, the first pixel value is 6 pixels; the second pixel value is 20 pixels; the third pixel value is 80 pixels.
The beneficial effects of the scheme are that: the method for detecting, identifying, combining, filtering and classifying the text information in the video frame images, in particular to a method for combining and filtering texts, extracting classification characteristics and constructing a caption classification model, achieves the effect of automatically detecting and extracting caption information in videos.
Drawings
FIG. 1 is a schematic flow diagram of a method of an exemplary embodiment of the present invention;
FIG. 2 is a schematic view of OCR recognition results according to an exemplary embodiment of the present invention;
FIG. 3 is a diagram of text merge results according to an exemplary embodiment of the present invention;
FIG. 4 is a flow chart of a text box merge method in accordance with an exemplary embodiment of the present invention;
FIG. 5 is a flow chart of a text box filtering method according to an exemplary embodiment of the invention;
FIG. 6 is a schematic diagram of a text box classification feature extraction flow in accordance with an exemplary embodiment of the invention;
FIG. 7 is a thermal schematic diagram of a text box according to an exemplary embodiment of the invention;
fig. 8 is a schematic diagram of a machine learning subtitle classification model construction flow according to an exemplary embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention, examples of which are illustrated in the accompanying drawings, wherein the embodiments described are by way of illustration only and not by way of limitation.
The general method flow diagram of the embodiment of the invention shown in fig. 1 mainly comprises:
and step 1, determining a time interval, extracting frames from the video, and performing text detection and recognition. One frame of image per second may be extracted from the target video, and for each frame of image, text within all video frame images is detected using Optical Character Recognition (OCR) techniques to form a text box, the text box being schematically shown in fig. 2, comprising four differently located text blocks of text 1, text 2, text 3 and text 4 prior to text detection using OCR techniques. OCR techniques first detect the position of each text block in an image and predict a corresponding tilted or non-tilted detection box for each text block. Thus, after text detection, text 1, text 2, text 3 and text 4 with text boxes appear in the image, wherein the text boxes of text 4 are tilted.
And then counting multi-dimensional characteristics according to the text box information and the video time axis information, wherein the multi-dimensional characteristics comprise text content, vertex coordinates and appearance time of each text box, and the multi-dimensional characteristics are used for obtaining a first text box set.
Specifically, the optical character recognition model used can recognize oblique text or irregularly arranged text, and can recognize only a single line of text, and the recognition result includes text content and four vertex coordinates of an irregularly rectangular text box.
Specifically, the appearance time refers to how many seconds this text box appears in the video.
And 2, for the text boxes in the first text box set, performing text merging by using the text box information of each text box in a single video frame and among a plurality of continuous frames to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to the occurrence time of the merged text boxes.
The text merging result is shown in fig. 3, and in fig. 3 (1), the texts 2 and 3 with similar positions and same size are merged into the same text box.
The text box merging method is shown in fig. 4, and the merging method for obtaining the second text box set mainly comprises the following steps:
aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule; updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box as the character height of the combined text;
the decision rule includes a decision that the following are simultaneously satisfied:
calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value, and the first pixel value is 6 pixels, so that the operation can ensure that the fonts of the combined text boxes are similar in size;
calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value, and the second pixel value is 20 pixels, so that the combined text box can be guaranteed to be close in the y-axis direction;
calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value, and the third pixel value is 80 pixels, so that the combined text box can be ensured to be approximately aligned in the transverse direction.
In fig. 3 (2), only text 1, text 2, and text 3 satisfy the above 3 rules at the same time, and may be combined into a text box, while text 4 satisfies the first 2 rules, and the combining operation cannot be performed.
Specifically, since the optical character recognition model can only recognize a single line of text, the text boxes in each video frame are combined first, so as to combine the text boxes belonging to the same text content in each second.
Specifically, text boxes belonging to the same text content in each second generally have the characteristics of the same font size, the same height adjacent in the picture, the same ziegler-type character, and the like, so that text merging can be performed according to the height, y-axis coordinates and width median lines of the text boxes respectively.
Specifically, when the text contents are combined, the text contents are combined in sequence from top to bottom according to the y-axis coordinates of the text boxes, so that the correct reading sequence of the text contents can be maintained.
Specifically, when the text boxes are merged, four-point coordinates of the merged text boxes are counted, and vertex coordinates of the merged text boxes are updated.
Specifically, when merging text boxes, the height of the original single line text box should be kept, which represents the character height of the merged text, and should be one of the multidimensional features of the merged text box.
And calculating the text similarity and the text box intersection ratio of adjacent frames aiming at the text boxes of all frames in a single video, and merging the text boxes meeting a preset intersection ratio threshold.
Specifically, text boxes for all frames in a single video are combined, so that text boxes with the same content are combined, and the duration and the position change of the text boxes are counted.
Specifically, since the optical character recognition result has a certain error rate, an appropriate text similarity method and threshold should be selected.
In particular, since close text content may appear for different subtitles within a video, it should be noted that text boxes are only merged in adjacent frames, the values within their duration list should be consecutive.
Specifically, there may be a rolling caption in the video picture, and non-later text in the video, and the positions of these text boxes will change, so a suitable intersection ratio threshold should be set, and the initial text box position and the offset of the final text box position are recorded.
And 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes.
The text box filtering method is shown in fig. 5, and the filtering method mainly comprises the following steps:
and deleting the text boxes with the duration exceeding the preset duration threshold from the second text box set according to the duration list.
Specifically, the resident text may be advertisement, watermark, etc., so that text boxes with too long duration can be removed through the duration list of the video.
And deleting the text boxes exceeding the preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates.
And calculating the inclination angle of the text box according to the coordinate position of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set.
In particular, since the subtitles are mostly not tilting text within the video, if text tilting occurs, such tilting text is highly probable to be non-late text within the video, as shown in fig. 3 (text 4 of 1), so a portion of the background text can be filtered out by the text box tilting pass.
Presetting a character quantity threshold, and deleting the text boxes which do not meet the character quantity threshold from the second text box set.
Specifically, if the detected text content is too short, text information which is possibly detected by non-later characters in the video, optical character recognition models or has no practical meaning should be removed.
And 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set.
As shown in fig. 6, the method for determining the feature vector of each text box in the filtered second text box set mainly includes:
the median duration of the individual characters in all text boxes in the second set of text boxes is calculated.
The absolute value of the difference between the single character duration of each text box and the median of the single character duration of the whole video is calculated as feature one.
Specifically, since the caption speed and the expression style of each video are different, the text box classification feature extracted according to the median can have better adaptability to the variation of the video style, and the following reasons for using the median are the same.
The median is calculated using the character heights of all text boxes in the second set of text boxes, and normalized according to the video pixel height.
Specifically, since the sizes of each video are different, the text box character height can be normalized by using the text box character height/the number of pixels of the video height, so that the feature scale in different videos is ensured to be the same.
And calculating the absolute value of the difference between the character height of each text box and the median of the character heights of all the text boxes, and taking the absolute value as a second characteristic.
And calculating a thermodynamic diagram of a text region of the whole video according to the text box positions and the durations, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all the text boxes in the second text box set. As shown in fig. 7, the thermodynamic diagram clearly shows that the frequency difference of characters in the video occurs at each position, wherein the legends with the numbers (1) and (2) are the case of no subtitle in the video, and the legend with the number (3) is the case of subtitle in the video.
Specifically, the calculation method of the text region thermodynamic diagram includes initializing a two-dimensional thermodynamic diagram matrix of 100 x 100, and setting a value to 0; traversing the text boxes in the second text box set in turn, adding a thermal value to the appearance position in each text box, wherein the thermal value calculating method is the duration of the text boxes/the duration of all the text boxes.
Specifically, the thermodynamic average value calculation method of the area where the text box is located is to calculate the average value of the corresponding thermodynamic diagram area according to the appearance position in the text box.
And calculating the absolute value of the difference between the thermal average value of each text box and the median of the thermal average value of the whole video as a feature III.
And combining the first feature, the second feature and the third feature into a vector to be used as the classification feature of each text box.
And 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle.
As shown in fig. 8, a flow chart of machine learning caption classification model construction method mainly comprises:
and (3) collecting a plurality of videos for training, and extracting classification characteristics from text boxes in the second text box set through the processing of the steps 1-4, wherein the three-dimensional characteristics of each text box are used as a training sample.
In particular, the video without subtitles will also have more non-video post text, and thus also need to participate in training.
Labeling whether each training sample is a subtitle or not;
and inputting training samples, taking the labels as expected outputs, and training a subtitle classification model for judging the subtitles in the video.
Specifically, a plurality of machine learning models can be selected to check the training effect, the model based on the adaboost algorithm is selected for training, and the practice finds that the combined effect is optimal and obviously superior to the classification model based on other algorithms.
And 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video.
In particular, the contents in the third text box set should include text box text contents, vertex coordinates, and duration lists, which are helpful for the continuation of the subsequent translation work and video understanding work.
The beneficial effects of the scheme are that: the method for detecting, identifying, combining, filtering and classifying the text information in the video frame images, in particular to a method for combining and filtering texts, extracting classification characteristics and constructing a caption classification model, achieves the effect of automatically detecting and extracting caption information in videos.
The above embodiments do not limit the present invention in any way, and through the above description, the related workers can completely make various changes and modifications without departing from the scope of the technical idea of the present invention, and all other improvements and applications made to the above embodiments in equivalent transformation form belong to the protection scope of the present invention, and the technical scope of the present invention is not limited to the content on the description, and must be determined according to the scope of claims.

Claims (7)

1. A subtitle extraction method based on video text merging, filtering and classifying is characterized by comprising the following steps:
step 1, extracting frames from a video at preset time intervals, performing text detection and recognition on frame images, including detecting texts in all frame images by using an optical character recognition technology to form text boxes, and counting multi-dimensional features according to text box information and video time axis information, wherein the multi-dimensional features comprise text content, vertex coordinates and appearance time of each text box to obtain a first text box set;
step 2, for the text boxes in the first text box set, performing text merging by using text box information of each text box in a single frame image and among a plurality of continuous frame images to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to occurrence time of the merged text boxes;
step 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes;
step 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set;
step 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle;
and 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video.
2. The method for extracting subtitles based on video text merging, filtering and classifying according to claim 1, wherein in the step 1, the vertex coordinates include x-axis coordinates and y-axis coordinates; in the step 2, obtaining the second text box set includes:
step 2-1, aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule; updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box as the character height of the combined text;
and 2-2, calculating the text similarity and the text box intersection ratio of adjacent frames aiming at the text boxes of all frames in a single video, and merging the text boxes meeting a preset intersection ratio threshold.
3. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 2, wherein in the step 3, the method for filtering the second text box set comprises:
step 3-1, deleting the text boxes with the duration exceeding a preset duration threshold from the second text box set according to the duration list;
deleting the text boxes exceeding a preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates;
calculating the inclination angle of the text box according to the vertex coordinates of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set;
presetting a character quantity threshold, and deleting the text boxes which do not meet the character quantity threshold from the second text box set.
4. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 3, wherein in the step 4, determining the classification characteristic of each text box comprises:
calculating a median of durations of individual characters in all text boxes in the second set of text boxes;
calculating the absolute value of the difference between the duration time and the median of the duration time of each text box as a feature I;
calculating a median by using the character heights of all text boxes in the second text box set, and normalizing the median according to the video pixel heights;
calculating the absolute value of the difference between the character height of each text box and the median of the character heights of all the text boxes to obtain a second characteristic;
calculating a thermodynamic diagram of a text region according to the coordinate position and duration of the text boxes, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all text boxes in the second text box set;
calculating the absolute value of the difference between the thermodynamic average value of each text box and the thermodynamic average value median of the whole video, and taking the absolute value as a third characteristic;
and combining the first feature, the second feature and the third feature into a feature vector of each text box.
5. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 4, wherein in the step 5, the subtitle classifying model is based on an adaboost algorithm, and the method for constructing the subtitle classifying model comprises the following steps:
collecting a plurality of videos for training, and extracting classification features from text boxes in a second text box set through the processing of the steps 1-4, wherein the three-dimensional feature of each text box is used as a training sample;
labeling whether each training sample is a subtitle or not;
and inputting training samples, taking the labels as expected outputs, and training a subtitle classification model for judging the subtitles in the video.
6. The method for extracting subtitles based on video text merging, filtering and classifying according to claim 5, wherein in the step 2-1, the decision rule comprises:
calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value;
calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value;
and calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value.
7. The method for extracting subtitle based on video text merging, filtering and classifying as defined in claim 6, wherein in the step 2-1, the first pixel value is 6 pixels; the second pixel value is 20 pixels; the third pixel value is 80 pixels.
CN202310579487.2A 2023-05-22 2023-05-22 Subtitle extraction method based on video text merging, filtering and classifying Pending CN116634223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310579487.2A CN116634223A (en) 2023-05-22 2023-05-22 Subtitle extraction method based on video text merging, filtering and classifying

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310579487.2A CN116634223A (en) 2023-05-22 2023-05-22 Subtitle extraction method based on video text merging, filtering and classifying

Publications (1)

Publication Number Publication Date
CN116634223A true CN116634223A (en) 2023-08-22

Family

ID=87591361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310579487.2A Pending CN116634223A (en) 2023-05-22 2023-05-22 Subtitle extraction method based on video text merging, filtering and classifying

Country Status (1)

Country Link
CN (1) CN116634223A (en)

Similar Documents

Publication Publication Date Title
US6731788B1 (en) Symbol Classification with shape features applied to neural network
US6614930B1 (en) Video stream classifiable symbol isolation method and system
JP2940936B2 (en) Tablespace identification method
CN101453575B (en) Video subtitle information extracting method
EP2034426A1 (en) Moving image analyzing, method and system
Yang et al. Lecture video indexing and analysis using video ocr technology
CN104298982A (en) Text recognition method and device
JP2006067585A (en) Method and apparatus for specifying position of caption in digital image and extracting thereof
CN110196917B (en) Personalized LOGO format customization method, system and storage medium
Chen et al. Text area detection from video frames
CN111626145A (en) Simple and effective incomplete form identification and page-crossing splicing method
CN114495141A (en) Document paragraph position extraction method, electronic equipment and storage medium
CN107368826A (en) Method and apparatus for text detection
Gui et al. A fast caption detection method for low quality video images
CN110378337B (en) Visual input method and system for drawing identification information of metal cutting tool
CN116634223A (en) Subtitle extraction method based on video text merging, filtering and classifying
JP3544324B2 (en) CHARACTER STRING INFORMATION EXTRACTION DEVICE AND METHOD, AND RECORDING MEDIUM CONTAINING THE METHOD
KR19990047501A (en) How to extract and recognize news video subtitles
Vu et al. Automatic extraction of text regions from document images by multilevel thresholding and k-means clustering
Chen et al. Video-text extraction and recognition
CN113888758B (en) Curved character recognition method and system based on complex scene
Al-Asadi et al. Arabic-text extraction from video images
CN113449713B (en) Method and device for cleaning training data of face detection model
Darahan et al. Real-Time Page Extraction for Document Digitization
Dayananda et al. A Comprehensive Study on Text Detection in Images and Videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination