CN116634223A

CN116634223A - Subtitle extraction method based on video text merging, filtering and classifying

Info

Publication number: CN116634223A
Application number: CN202310579487.2A
Authority: CN
Inventors: 贾馥玮; 房鹏展
Original assignee: Focus Technology Co Ltd
Current assignee: Focus Technology Co Ltd
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-22

Abstract

The invention discloses a subtitle extraction method based on video text merging, filtering and classifying, which comprises the steps of extracting frames from a video, and identifying texts in all video frames by using optical characters to obtain a text box set in the video; combining and filtering the text box set in the video according to text content, text box coordinates, text appearance time and the like; predicting whether each text box is a subtitle or not by using a subtitle classification model based on machine learning, and storing the text determined to be the subtitle type and the position information thereof as the subtitle information of the video. The method of the invention can preliminarily filter most texts which do not belong to the caption type by merging and filtering the text boxes, and can further determine the text box type by constructing a machine learning caption classification model. The method can solve the problem of changeable positions of the existing Internet video subtitles without defining subtitle areas.

Description

Subtitle extraction method based on video text merging, filtering and classifying

Technical Field

The invention relates to the technical field of computers, in particular to the technical field of computer vision technology and machine learning, and particularly relates to a subtitle extraction method based on video text merging, filtering and classifying.

Background

With the development of self-media and electronic commerce, the number of short video authoring has increased dramatically. The subtitle information in the video is extracted, so that the subtitle information can be further used in combination with a large-scale semantic understanding model, text dimension is added to assist in video understanding on the basis of image and video picture understanding, the subtitle information can be used for a plurality of other scenes, such as multi-language translation during transnational propagation, and a subtitle-free video is identified for subtitle generation and the like.

Subtitles are of different kinds, such as dominant subtitles, captions, talking subtitles, etc. In the existing subtitle extraction technology, the common method is to set a subtitle region first, then use an optical character recognition technology to extract the characters in a specific region, and the method is more suitable for videos with certain rules in subtitle positions such as film and television programs. In addition, there is a multi-modal subtitle extraction model based on voice recognition and optical character recognition, and the multi-modal model mainly extracts talking subtitles, and although the accuracy of subtitle text can be effectively increased, the multi-modal subtitle extraction model is difficult to be applied to subtitle types without voice accompanying.

The caption positions in the short video are changeable, and a general template cannot be used for extracting caption information, so how to use a general method for extracting multiple types of captions is a problem to be solved.

Disclosure of Invention

The invention aims to solve the technical problems of overcoming the defects of the prior art and solving the pertinence and universality problems of a general subtitle extraction method in the short video field, and provides a subtitle extraction method based on video text merging, filtering and classifying.

In order to solve the technical problems, the invention provides a subtitle extraction method based on video text merging, filtering and classifying, which is characterized by comprising the following steps:

step 1, extracting frames from a video at preset time intervals, performing text detection and recognition on frame images, including detecting texts in all frame images by using an Optical Character Recognition (OCR) technology to form text boxes, and counting multi-dimensional features according to text box information and video time axis information, wherein the multi-dimensional features comprise text content, vertex coordinates and appearance time of each text box to obtain a first text box set;

step 2, for the text boxes in the first text box set, performing text merging by using text box information of each text box in a single frame image and among a plurality of continuous frame images to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to occurrence time of the merged text boxes;

step 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes;

step 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set;

step 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle;

and 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video.

In the step 1, the vertex coordinates include x-axis coordinates and y-axis coordinates; in the step 2, obtaining the second text box set includes:

step 2-1, aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule; updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box as the character height of the combined text;

and 2-2, calculating the text similarity and the text box intersection ratio of adjacent frames aiming at the text boxes of all frames in a single video, and merging the text boxes meeting a preset intersection ratio threshold.

In the step 3, the method for filtering the second text box set includes:

step 3-1, deleting the text boxes with the duration exceeding a preset duration threshold from the second text box set according to the duration list;

deleting the text boxes exceeding a preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates;

calculating the inclination angle of the text box according to the vertex coordinates of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set;

presetting a character quantity threshold, and deleting the text boxes which do not meet the character quantity threshold from the second text box set.

In the step 4, determining the classification characteristic of each text box includes:

calculating a median of durations of individual characters in all text boxes in the second set of text boxes;

calculating the absolute value of the difference between the duration time and the median of the duration time of each text box as a feature I;

calculating a median by using the character heights of all text boxes in the second text box set, and normalizing the median according to the video pixel heights;

calculating the absolute value of the difference between the character height of each text box and the median of the character heights of all the text boxes to obtain a second characteristic;

calculating a thermodynamic diagram of a text region according to the coordinate position and duration of the text boxes, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all text boxes in the second text box set;

calculating the absolute value of the difference between the thermodynamic average value of each text box and the thermodynamic average value median of the whole video, and taking the absolute value as a third characteristic;

and combining the first feature, the second feature and the third feature into a feature vector of each text box.

In the step 5, the caption classification model is based on an adaboost algorithm, and the construction method of the caption classification model comprises the following steps:

collecting a plurality of videos for training, and extracting classification features from text boxes in a second text box set through the processing of the steps 1-4, wherein the three-dimensional feature of each text box is used as a training sample;

labeling whether each training sample is a subtitle or not;

and inputting training samples, taking the labels as expected outputs, and training a subtitle classification model for judging the subtitles in the video.

In the step 2-1, the decision rule includes:

calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value;

calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value;

and calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value.

In the step 2-1, the first pixel value is 6 pixels; the second pixel value is 20 pixels; the third pixel value is 80 pixels.

The beneficial effects of the scheme are that: the method for detecting, identifying, combining, filtering and classifying the text information in the video frame images, in particular to a method for combining and filtering texts, extracting classification characteristics and constructing a caption classification model, achieves the effect of automatically detecting and extracting caption information in videos.

Drawings

FIG. 1 is a schematic flow diagram of a method of an exemplary embodiment of the present invention;

FIG. 2 is a schematic view of OCR recognition results according to an exemplary embodiment of the present invention;

FIG. 3 is a diagram of text merge results according to an exemplary embodiment of the present invention;

FIG. 4 is a flow chart of a text box merge method in accordance with an exemplary embodiment of the present invention;

FIG. 5 is a flow chart of a text box filtering method according to an exemplary embodiment of the invention;

FIG. 6 is a schematic diagram of a text box classification feature extraction flow in accordance with an exemplary embodiment of the invention;

FIG. 7 is a thermal schematic diagram of a text box according to an exemplary embodiment of the invention;

fig. 8 is a schematic diagram of a machine learning subtitle classification model construction flow according to an exemplary embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention, examples of which are illustrated in the accompanying drawings, wherein the embodiments described are by way of illustration only and not by way of limitation.

The general method flow diagram of the embodiment of the invention shown in fig. 1 mainly comprises:

and step 1, determining a time interval, extracting frames from the video, and performing text detection and recognition. One frame of image per second may be extracted from the target video, and for each frame of image, text within all video frame images is detected using Optical Character Recognition (OCR) techniques to form a text box, the text box being schematically shown in fig. 2, comprising four differently located text blocks of text 1, text 2, text 3 and text 4 prior to text detection using OCR techniques. OCR techniques first detect the position of each text block in an image and predict a corresponding tilted or non-tilted detection box for each text block. Thus, after text detection, text 1, text 2, text 3 and text 4 with text boxes appear in the image, wherein the text boxes of text 4 are tilted.

And then counting multi-dimensional characteristics according to the text box information and the video time axis information, wherein the multi-dimensional characteristics comprise text content, vertex coordinates and appearance time of each text box, and the multi-dimensional characteristics are used for obtaining a first text box set.

Specifically, the optical character recognition model used can recognize oblique text or irregularly arranged text, and can recognize only a single line of text, and the recognition result includes text content and four vertex coordinates of an irregularly rectangular text box.

Specifically, the appearance time refers to how many seconds this text box appears in the video.

And 2, for the text boxes in the first text box set, performing text merging by using the text box information of each text box in a single video frame and among a plurality of continuous frames to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to the occurrence time of the merged text boxes.

The text merging result is shown in fig. 3, and in fig. 3 (1), the texts 2 and 3 with similar positions and same size are merged into the same text box.

The text box merging method is shown in fig. 4, and the merging method for obtaining the second text box set mainly comprises the following steps:

aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule; updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box as the character height of the combined text;

the decision rule includes a decision that the following are simultaneously satisfied:

calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value, and the first pixel value is 6 pixels, so that the operation can ensure that the fonts of the combined text boxes are similar in size;

calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value, and the second pixel value is 20 pixels, so that the combined text box can be guaranteed to be close in the y-axis direction;

calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value, and the third pixel value is 80 pixels, so that the combined text box can be ensured to be approximately aligned in the transverse direction.

In fig. 3 (2), only text 1, text 2, and text 3 satisfy the above 3 rules at the same time, and may be combined into a text box, while text 4 satisfies the first 2 rules, and the combining operation cannot be performed.

Specifically, since the optical character recognition model can only recognize a single line of text, the text boxes in each video frame are combined first, so as to combine the text boxes belonging to the same text content in each second.

Specifically, text boxes belonging to the same text content in each second generally have the characteristics of the same font size, the same height adjacent in the picture, the same ziegler-type character, and the like, so that text merging can be performed according to the height, y-axis coordinates and width median lines of the text boxes respectively.

Specifically, when the text contents are combined, the text contents are combined in sequence from top to bottom according to the y-axis coordinates of the text boxes, so that the correct reading sequence of the text contents can be maintained.

Specifically, when the text boxes are merged, four-point coordinates of the merged text boxes are counted, and vertex coordinates of the merged text boxes are updated.

Specifically, when merging text boxes, the height of the original single line text box should be kept, which represents the character height of the merged text, and should be one of the multidimensional features of the merged text box.

And calculating the text similarity and the text box intersection ratio of adjacent frames aiming at the text boxes of all frames in a single video, and merging the text boxes meeting a preset intersection ratio threshold.

Specifically, text boxes for all frames in a single video are combined, so that text boxes with the same content are combined, and the duration and the position change of the text boxes are counted.

Specifically, since the optical character recognition result has a certain error rate, an appropriate text similarity method and threshold should be selected.

In particular, since close text content may appear for different subtitles within a video, it should be noted that text boxes are only merged in adjacent frames, the values within their duration list should be consecutive.

Specifically, there may be a rolling caption in the video picture, and non-later text in the video, and the positions of these text boxes will change, so a suitable intersection ratio threshold should be set, and the initial text box position and the offset of the final text box position are recorded.

And 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes.

The text box filtering method is shown in fig. 5, and the filtering method mainly comprises the following steps:

and deleting the text boxes with the duration exceeding the preset duration threshold from the second text box set according to the duration list.

Specifically, the resident text may be advertisement, watermark, etc., so that text boxes with too long duration can be removed through the duration list of the video.

And deleting the text boxes exceeding the preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates.

And calculating the inclination angle of the text box according to the coordinate position of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set.

In particular, since the subtitles are mostly not tilting text within the video, if text tilting occurs, such tilting text is highly probable to be non-late text within the video, as shown in fig. 3 (text 4 of 1), so a portion of the background text can be filtered out by the text box tilting pass.

Specifically, if the detected text content is too short, text information which is possibly detected by non-later characters in the video, optical character recognition models or has no practical meaning should be removed.

And 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set.

As shown in fig. 6, the method for determining the feature vector of each text box in the filtered second text box set mainly includes:

the median duration of the individual characters in all text boxes in the second set of text boxes is calculated.

The absolute value of the difference between the single character duration of each text box and the median of the single character duration of the whole video is calculated as feature one.

Specifically, since the caption speed and the expression style of each video are different, the text box classification feature extracted according to the median can have better adaptability to the variation of the video style, and the following reasons for using the median are the same.

The median is calculated using the character heights of all text boxes in the second set of text boxes, and normalized according to the video pixel height.

Specifically, since the sizes of each video are different, the text box character height can be normalized by using the text box character height/the number of pixels of the video height, so that the feature scale in different videos is ensured to be the same.

And calculating the absolute value of the difference between the character height of each text box and the median of the character heights of all the text boxes, and taking the absolute value as a second characteristic.

And calculating a thermodynamic diagram of a text region of the whole video according to the text box positions and the durations, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all the text boxes in the second text box set. As shown in fig. 7, the thermodynamic diagram clearly shows that the frequency difference of characters in the video occurs at each position, wherein the legends with the numbers (1) and (2) are the case of no subtitle in the video, and the legend with the number (3) is the case of subtitle in the video.

Specifically, the calculation method of the text region thermodynamic diagram includes initializing a two-dimensional thermodynamic diagram matrix of 100 x 100, and setting a value to 0; traversing the text boxes in the second text box set in turn, adding a thermal value to the appearance position in each text box, wherein the thermal value calculating method is the duration of the text boxes/the duration of all the text boxes.

Specifically, the thermodynamic average value calculation method of the area where the text box is located is to calculate the average value of the corresponding thermodynamic diagram area according to the appearance position in the text box.

And calculating the absolute value of the difference between the thermal average value of each text box and the median of the thermal average value of the whole video as a feature III.

And combining the first feature, the second feature and the third feature into a vector to be used as the classification feature of each text box.

And 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle.

As shown in fig. 8, a flow chart of machine learning caption classification model construction method mainly comprises:

and (3) collecting a plurality of videos for training, and extracting classification characteristics from text boxes in the second text box set through the processing of the steps 1-4, wherein the three-dimensional characteristics of each text box are used as a training sample.

In particular, the video without subtitles will also have more non-video post text, and thus also need to participate in training.

Labeling whether each training sample is a subtitle or not;

Specifically, a plurality of machine learning models can be selected to check the training effect, the model based on the adaboost algorithm is selected for training, and the practice finds that the combined effect is optimal and obviously superior to the classification model based on other algorithms.

In particular, the contents in the third text box set should include text box text contents, vertex coordinates, and duration lists, which are helpful for the continuation of the subsequent translation work and video understanding work.

The above embodiments do not limit the present invention in any way, and through the above description, the related workers can completely make various changes and modifications without departing from the scope of the technical idea of the present invention, and all other improvements and applications made to the above embodiments in equivalent transformation form belong to the protection scope of the present invention, and the technical scope of the present invention is not limited to the content on the description, and must be determined according to the scope of claims.

Claims

1. A subtitle extraction method based on video text merging, filtering and classifying is characterized by comprising the following steps:

step 1, extracting frames from a video at preset time intervals, performing text detection and recognition on frame images, including detecting texts in all frame images by using an optical character recognition technology to form text boxes, and counting multi-dimensional features according to text box information and video time axis information, wherein the multi-dimensional features comprise text content, vertex coordinates and appearance time of each text box to obtain a first text box set;

2. The method for extracting subtitles based on video text merging, filtering and classifying according to claim 1, wherein in the step 1, the vertex coordinates include x-axis coordinates and y-axis coordinates; in the step 2, obtaining the second text box set includes:

3. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 2, wherein in the step 3, the method for filtering the second text box set comprises:

4. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 3, wherein in the step 4, determining the classification characteristic of each text box comprises:

5. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 4, wherein in the step 5, the subtitle classifying model is based on an adaboost algorithm, and the method for constructing the subtitle classifying model comprises the following steps:

labeling whether each training sample is a subtitle or not;

6. The method for extracting subtitles based on video text merging, filtering and classifying according to claim 5, wherein in the step 2-1, the decision rule comprises:

7. The method for extracting subtitle based on video text merging, filtering and classifying as defined in claim 6, wherein in the step 2-1, the first pixel value is 6 pixels; the second pixel value is 20 pixels; the third pixel value is 80 pixels.