CN111860262A - Video subtitle extraction method and device - Google Patents

Video subtitle extraction method and device Download PDF

Info

Publication number
CN111860262A
CN111860262A CN202010665068.7A CN202010665068A CN111860262A CN 111860262 A CN111860262 A CN 111860262A CN 202010665068 A CN202010665068 A CN 202010665068A CN 111860262 A CN111860262 A CN 111860262A
Authority
CN
China
Prior art keywords
image
frame
sum
caption
subtitle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010665068.7A
Other languages
Chinese (zh)
Other versions
CN111860262B (en
Inventor
田广军
郎梦园
张立国
金梅
张勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yanshan University
Original Assignee
Yanshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanshan University filed Critical Yanshan University
Priority to CN202010665068.7A priority Critical patent/CN111860262B/en
Publication of CN111860262A publication Critical patent/CN111860262A/en
Application granted granted Critical
Publication of CN111860262B publication Critical patent/CN111860262B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Input (AREA)
  • Studio Circuits (AREA)

Abstract

The invention discloses a method and a device for extracting video subtitles, which belong to the field of digital image processing and mainly comprise the steps of reading a video needing subtitle detection and detecting subtitle frames; positioning a caption area in a caption frame; subtitle extraction and OCR recognition. The caption frame detection is carried out by utilizing the inter-frame angular point difference, and a partial pixel accumulation method is proposed to position the caption area in the caption frame. The invention realizes the detection and extraction of the embedded captions in the video and provides guarantee for realizing subsequent functions such as retrieval, translation and the like.

Description

Video subtitle extraction method and device
Technical Field
The invention relates to the field of digital image processing, in particular to a video subtitle extraction method and device.
Background
At present, video retrieval mainly depends on matching of title keywords and video tags, the retrieval mode is single, and the title keywords and the video tags reflect video contents incompletely, so that the video retrieval according to the video subtitle contents can be used as one of supplementary retrieval modes. And with the coming of globalization, the situation that the characters are not communicated occurs sometimes, so that the automatic translation of the video captions is realized, and the user experience can be improved.
The extraction of the video subtitles mainly has two key points: extracting caption frames and positioning caption positions. Common subtitle frame extraction techniques include: histogram-based algorithms, pixel difference-based algorithms, contour-based algorithms, and the like; the common techniques for positioning the subtitle position include: edge-based positioning methods, connected region-based positioning methods, machine learning-based methods, and the like. The prior arts have problems of poor calculation effect, large calculation amount and the like.
Disclosure of Invention
Aiming at the problems of poor calculation effect and huge calculation amount of the existing video subtitle detection and extraction method, the invention aims to provide a video subtitle extraction method and a video subtitle extraction device so as to realize extraction of embedded subtitles in a video.
The invention provides the following technical scheme:
in one aspect, the present invention provides a method for extracting video subtitles, including:
reading a video of a subtitle to be detected;
detecting a subtitle frame in the video based on the number of the angular points;
positioning a caption area in each caption frame;
and extracting the caption from the positioned caption area and performing optical character recognition to obtain caption characters.
Preferably, the detecting the caption frame in the video based on the number of the corner points includes:
Converting each frame of image in the video into a gray level image;
detecting the angular points of each frame of image converted into the gray level image, and recording the number of the angular points of each frame;
taking the frame with the angular point number meeting the preset condition as a subtitle frame; the preset conditions include: the number of the angular points of the frame is greater than that of the angular points of the previous frame, and the number of the angular points of the frame is greater than 15; the absolute value of the difference between the number of the angular points of the frame and the previous frame is larger than the average value, and the average value is the average value of the absolute values of the difference between the number of the angular points of all the frames and the previous frame within 3 seconds after the frame.
Preferably, the positioning of the subtitle region in each subtitle frame includes:
for each subtitle frame, converting the subtitle frame into a gray image;
intercepting an image with one fourth of the height of the bottom of the gray level image;
performing edge detection on the intercepted image by using a Laplacian operator;
carrying out binarization processing on the image subjected to edge detection by using a Daohu method;
closing the binarized image;
and positioning the subtitle position in the image subjected to the closing operation by using the partial pixel accumulation method.
Preferably, the extracting subtitles from the located subtitle regions and performing optical character recognition includes:
Reading coordinates of a subtitle position;
intercepting a source image of the caption part according to the coordinates and converting the source image into a gray image;
performing median filtering on the gray level image;
performing edge detection on the image subjected to median filtering by using a Laplace algorithm;
binarizing the image by using a large law method;
and carrying out optical character recognition on the characters in the binarized image to obtain caption characters.
Preferably, the locating the subtitle position in the image subjected to the closing operation by using the partial pixel addition method includes:
the image subjected to the closing operation is centered and continuously positioned at [ L/2-L/40, L/2+ L/40 along the horizontal direction of the image]All pixel columns within the length range are set as the pixel value Oi
Starting with i equal to 1 and i increasing by 1, when OiWhen the SUM of the pixel and the lower continuous 20 pixel values is 1, O _ SUM is calculatedupWherein, O _ SUMup=Oi+Oi+1+Oi+2+Oi+3+...+Oi+20
When O _ SUMupWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the smallest y of all position coordinates recordediValue y ofiY being the top coordinate of the captionminA value;
starting from when i equals W and i is decremented by 1, when OiWhen the SUM of the pixel and the upper continuous 20 pixel values is 1, O _ SUM is calculateddownWherein, O _ SUM down=Oi+Oi-1+Oi-2+Oi-3+...+Oi-20(ii) a W is the pixel width of the image, and L is the pixel length of the image;
when O _ SUMdownWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the largest y of all position coordinates recordediValue y ofiY being the bottom coordinate of the captionmaxA value;
selecting the image subjected to the closing operation in the vertical direction of the image as [ ymin,ymax]All pixel rows within the range are given a pixel value of Oj
Starting from the moment when j equals 1 and incrementing j by 1, the SUM O _ SUM of the right consecutive 20 pixel values is calculatedleftWherein, O _ SUMleft=Oj+Oj+1+Oj+2+Oj+3+...+Oj+20
When O _ SUMleftWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the smallest x of all position coordinates of the recordjValue of xjX being the left-hand coordinate of the subtitleminA value;
starting from the moment when j equals L and j is decremented by 1, the SUM O _ SUM of the left consecutive 20 pixel values is calculatedrightWherein, O _ SUMright=Oj+Oj-1+Oj-2+Oj-3+...+Oj-20
When O _ SUMdownWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the largest x of all position coordinates of the recordjValue of xjX being the right coordinate of the captionmaxA value;
saving the position coordinates of the caption, wherein the coordinates of the upper left corner of the caption are (x)min+5,ymin+5) and the lower right corner coordinate is (x)max+5,ymax+5)。
In another aspect, the present invention further provides a video subtitle extracting apparatus, including:
the reading unit is used for reading the video of the subtitle to be detected;
The detection unit is used for detecting the subtitle frames in the video read by the reading unit based on the number of the angular points;
a positioning unit, configured to position the caption area in each caption frame detected by the detection unit;
and the extraction unit is used for extracting the caption from the caption area positioned by the positioning unit and carrying out optical character recognition to obtain caption characters.
Preferably, the detection unit is specifically configured to:
converting each frame of image in the video into a gray level image; carrying out angular point detection on each frame of image, and recording the number of angular points of each frame; taking the frame with the angular point number meeting the preset condition as a subtitle frame; the preset conditions include: the number of the angular points of the frame is greater than that of the angular points of the previous frame, and the number of the angular points of the frame is greater than 15; the absolute value of the difference between the number of the angular points of the frame and the previous frame is larger than the average value, and the average value is the average value of the absolute values of the difference between the number of the angular points of all the frames and the previous frame within 3 seconds after the frame.
Preferably, the positioning unit is specifically configured to:
for each subtitle frame, converting the subtitle frame into a gray image; intercepting an image with one fourth of the height of the bottom of the gray level image; performing edge detection on the intercepted image by using a Laplacian operator; carrying out binarization processing on the image subjected to edge detection by using a Daohu method; closing the binarized image; and positioning the subtitle position in the image subjected to the closing operation by using the partial pixel accumulation method.
Preferably, the extraction unit is specifically configured to:
reading coordinates of a subtitle position; intercepting a source image of the caption part according to the coordinates and converting the source image into a gray image; performing median filtering on the gray level image; performing edge detection on the image subjected to median filtering by using a Laplace algorithm; binarizing the image by using a large law method; and carrying out optical character recognition on the characters in the binarized image to obtain caption characters.
Preferably, the positioning unit is configured to position the subtitle position in the image subjected to the closing operation using partial pixel addition, and includes:
the image subjected to the closing operation is centered and continuously positioned at [ L/2-L/40, L/2+ L/40 along the horizontal direction of the image]All pixel columns within the length range are set as the pixel value Oi
Starting with i equal to 1 and i increasing by 1, when OiWhen the SUM of the pixel and the lower continuous 20 pixel values is 1, O _ SUM is calculatedupWherein, O _ SUMup=Oi+Oi+1+Oi+2+Oi+3+...+Oi+20
When O _ SUMupWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the smallest y of all position coordinates recordediValue y ofiY being the top coordinate of the captionminA value;
starting from when i equals W and i is decremented by 1, when OiWhen the SUM of the pixel and the upper continuous 20 pixel values is 1, O _ SUM is calculated downWherein, O _ SUMdown=Oi+Oi-1+Oi-2+Oi-3+...+Oi-20(ii) a W is the pixel width of the image, and L is the pixel length of the image;
when O _ SUMdownWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the largest y of all position coordinates recordediValue y ofiY being the bottom coordinate of the captionmaxA value;
selecting the image subjected to the closing operation in the vertical direction of the image as [ ymin,ymax]All pixel rows within the range are given a pixel value of Oj
Starting from the moment when j equals 1 and incrementing j by 1, the SUM O _ SUM of the right consecutive 20 pixel values is calculatedleftWherein, O _ SUMleft=Oj+Oj+1+Oj+2+Oj+3+...+Oj+20
When O _ SUMlefWhen t > 10, record OjPosition coordinates (x)j,yj) Selecting the smallest x of all position coordinates of the recordjValue of xjX being the left-hand coordinate of the subtitleminA value;
starting from the moment when j equals L and j is decremented by 1, the SUM O _ SUM of the left consecutive 20 pixel values is calculatedrightWherein, O _ SUMright=Oj+Oj-1+Oj-2+Oj-3+...+Oj-20
When O _ SUMdownWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the largest x of all position coordinates of the recordjValue of xjX being the right coordinate of the captionmaxA value;
saving the position coordinates of the caption, wherein the coordinates of the upper left corner of the caption are (x)min+5,ymin+5) and the lower right corner coordinate is (x)max+5,ymax+5)。
The video subtitle extracting method and device provided by the invention have the following beneficial effects:
(1) In the video subtitle extraction method provided by the invention, the subtitle frames are judged by using the angular point number, and in the judgment condition of the angular point number, the angular point number difference between the two frames before and after is used for judgment, so that the video frames can be accurately detected, and the condition that detection is possibly missed by other methods because the subtitle characters are too few is effectively avoided; it is also possible to reduce the number of cases in which a transition frame in a video is determined to be a caption frame. The method has the advantages that the caption frame recognition is carried out by utilizing the characteristic that the number of the corner points is greatly changed when the caption appears, the condition that a plurality of caption frames containing the same caption are obtained when pixel difference calculation is used is effectively avoided, and therefore the subsequent calculation amount is reduced.
(2) In the video subtitle extraction method provided by the invention, when the subtitle region is detected, only the image with the height of one fourth of the bottom of the image is adopted for detection, so that the calculated amount is effectively reduced.
(3) In the video subtitle extraction method provided by the invention, the positions of the subtitles are positioned by using a partial pixel cumulative addition method, the method uses a traditional algorithm for calculation, the calculation amount is less than that of a machine learning-based method, and meanwhile, the detection effect can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a video subtitle extracting method according to an embodiment of the present invention;
fig. 2 is a flow chart of subtitle frame detection according to an embodiment of the present invention;
fig. 3 is a flowchart of subtitle region positioning according to an embodiment of the present invention;
FIG. 4 is a flow chart of subtitle extraction and optical character recognition according to an embodiment of the present invention;
fig. 5 is a flowchart of a partial pixel value accumulation method used in subtitle region positioning according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a partial pixel value accumulation area used in positioning a subtitle area according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the invention provides a video subtitle extracting method and device. The method can be applied to extracting the embedded captions in the video and is used for realizing subsequent functions such as retrieval, translation and the like.
Referring to fig. 1, a flowchart of a video subtitle detection and extraction method according to an embodiment of the present invention is shown. The method comprises the following steps:
S101, reading a video of a subtitle to be detected; and detects a caption frame in the video based on the number of corner points.
Because the number of the angular points in the image is changed greatly due to the appearance and the disappearance of the subtitles, the large change of the number of the angular points can directly reflect the appearance and the disappearance of the subtitles, meanwhile, the probability that the number of the angular points is very small after the subtitles appear is smaller than the threshold value of 15, the number of the angular points when the subtitles appear is increased, and the frame can be judged as a subtitle frame when the three conditions are met.
One possible implementation of detecting the caption frames in the video based on the number of corners is as follows:
referring to fig. 2, a flow chart of subtitle frame detection in an embodiment of the present invention is shown. The method comprises the following steps:
s201, reading a video and converting each frame of image into a gray image;
s202, performing Harris corner detection on each frame of image, and recording the number of corners of each frame;
s203, taking the frame with the number of the angular points meeting the preset condition as a caption frame; the preset conditions include: the number of the angular points of the frame is greater than that of the angular points of the previous frame, the number of the angular points of the frame is greater than 15, the absolute value of the difference value of the number of the angular points between the frame and the previous frame is greater than the average value, and the average value is the average value of the absolute values of the difference values of the number of the angular points between all the frames and the previous frame within 3 seconds after the frame.
And S204, storing all the caption frames.
Recording the number of angular points of the ith frame as NiCalculating the absolute value M of the difference between the numbers of corner points of adjacent framesi+1As shown in formula (1):
Mi+1=|Ni+1-Ni| (1)
wherein, the range of i is [1, n ], and n is the total frame number of the video.
Setting Ni+1The corresponding image frame time node is t, and M of the frame and all the frames in the following 3s is calculatedi+1And for all Mi+1Averaging M _ AVGi+1As shown in formula (2):
Figure BDA0002580039370000071
wherein z is the frame number within 3s after the time node t, and fps is the frame number per second of the video.
When N is presenti+1When the corresponding frame meets the judgment condition, judging Ni+1The corresponding frame is a caption frame, and the judgment condition is as follows (3):
Figure BDA0002580039370000072
and S102, positioning the subtitle area in each subtitle frame.
S103, extracting subtitles from the positioned subtitle areas and carrying out optical character recognition to obtain subtitle characters.
And S104, storing the obtained caption characters.
In the embodiment of the invention, the number of the angular points is used for judging the subtitle frames, and in the judgment condition of the number of the angular points, the difference value of the number of the angular points between the two frames before and after is used for judging, so that the video frames can be accurately detected, and the condition that detection is possibly missed by other methods because the number of the subtitle characters is too small is effectively avoided; it is also possible to reduce the number of cases in which a transition frame in a video is determined to be a caption frame. The method has the advantages that the caption frame recognition is carried out by utilizing the characteristic that the number of the corner points is greatly changed when the caption appears, the condition that a plurality of caption frames containing the same caption are obtained when pixel difference calculation is used is effectively avoided, and therefore the subsequent calculation amount is reduced.
In the above embodiment, one possible implementation manner of step S103 is as follows:
referring to fig. 3, a flowchart of subtitle area positioning according to an embodiment of the present invention is shown. The method comprises the following steps:
s301, converting the read subtitle frame into a grayscale image,
s302, the subtitle frame image is partially intercepted, only the image with the height of the bottom 1/4 of the image is saved, and as the position of the subtitle is basically positioned at the bottom 1/4 of the image, only a part of the image is reserved;
and S303, performing edge detection on the image by using a Laplacian (Laplacian) operator.
The Laplacian operator is used for edge detection, so that not only can subtitles in the image be effectively extracted, but also unnecessary noise can be better avoided.
And S304, carrying out binarization processing on the image after edge detection by using a Law method.
And S305, performing closing operation on the binarized image.
And S306, positioning the subtitle position by using a partial pixel accumulation method.
In the embodiment of the invention, when the subtitle region is detected, only the image with the height of one fourth of the bottom of the image is adopted for detection, so that the calculated amount is effectively reduced.
The specific implementation manner of step S306 is:
referring to fig. 5, a flowchart of a partial pixel value accumulation method used in subtitle region positioning according to an embodiment of the present invention is shown, including the following steps:
Step 1: and selecting all pixel columns which are centered and are continuously within the length range of [ L/2-L/40, L/2+ L/40] from the image subjected to the closing operation along the horizontal direction of the image, namely detecting the dotted line area 2 in the graph 6.
Let the value of the pixel be OiStarting with i equal to 1 and i increasing by 1, when OiWhen the SUM of the pixel and the lower continuous 20 pixel values is 1, O _ SUM is calculatedup
Wherein, O _ SUMup=Oi+Oi+1+Oi+2+Oi+3+...+Oi+20(ii) a W is the pixel width of the image, L is the pixel length of the image, and i ranges from [1,2,3 … …, W-20](ii) a W and L are the pixel width and the pixel length, respectively, of the subtitle frame image 1 in fig. 6.
When O _ SUMupWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the smallest y of all position coordinates recordediValue y ofiY being the top coordinate of the captionminThe value is obtained.
Step 2: and selecting all pixel columns which are centered and are continuously within the length range of [ L/2-L/40, L/2+ L/40] from the image subjected to the closing operation along the horizontal direction of the image, namely detecting the dotted line area 2 in the graph 6.
Let the pixel value be OiStarting from when i equals W and i decrements by 1, when OiWhen the SUM of the pixel and the upper continuous 20 pixel values is 1, O _ SUM is calculateddown
Wherein, O _ SUMdown=Oi+Oi-1+Oi-2+Oi-3+...+Oi-20(ii) a W is the pixel width of the image, and L is the pixel length of the image; i ranges from [ W, W-1, W-2, … …,21 ]。
When O _ SUMdownWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the largest y of all position coordinates recordediValue y ofiY being the bottom coordinate of the captionmaxA value;
and step 3: selecting the image subjected to the closing operation in the vertical direction of the image as [ ymin,ymax]All pixel rows within the range; i.e. the dashed line region 4 in fig. 6 is detected.
Let the pixel value be OjStarting with j equal to 1 and incrementing j by 1, the SUM O _ SUM of the right consecutive 20 pixel values is calculatedleft
Wherein, O _ SUMleft=Oj+Oj+1+Oj+2+Oj+3+...+Oj+20(ii) a W is the pixel width of the image, L is the pixel length of the image, and j ranges from [1,2,3 … …, L-20 ]]。
When O _ SUMlefWhen t > 10, record OjPosition coordinates (x)j,yj) Selecting the smallest x of all position coordinates of the recordjValue of xjX being the left-hand coordinate of the subtitleminA value;
and 4, step 4: selecting the image subjected to the closing operation along the vertical direction of the imageIs taken from [ ymin,ymax]All pixel rows within the range; i.e. the dashed line region 4 in fig. 6 is detected.
Let the pixel value be OjStarting from when j equals L and j decrements by 1, the SUM O _ SUM of 20 consecutive pixel values on the left side is calculatedright
Wherein, O _ SUMright=Oj+Oj-1+Oj-2+Oj-3+...+Oj-20(ii) a W is the pixel width of the image, L is the pixel length of the image, and j ranges from [ L, L-1, L-2, … …,21 ]。
When O _ SUMdownWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the largest x of all position coordinates of the recordjValue of xjX being the right coordinate of the captionmaxA value;
and 5: obtaining and storing the position coordinates of the subtitles according to the calculation result, wherein the coordinate of the upper left corner is (x)min+5,ymin+5) and the lower right corner coordinate is (x)max+5,ymax+5), and the coordinate value of each coordinate point plus 5 is to avoid losing the edge of the point part of the character during processing. The range of the resulting subtitle corresponds to the subtitle region 3 format in fig. 6.
In the embodiment of the invention, the position of the caption is positioned by using a partial pixel accumulation method, and the method uses the traditional algorithm for calculation, so that the calculation amount is less than that of a machine learning-based method, and meanwhile, the detection effect can be effectively improved.
In the above embodiment, one possible implementation manner of step S103 is as follows:
referring to fig. 4, a flowchart of subtitle extraction and optical character recognition according to an embodiment of the present invention is shown, including the following steps:
s401, reading coordinates of a subtitle position;
s402, intercepting a source image of a caption part according to coordinates and converting the source image into a gray image;
s403, performing median filtering on the gray level image to remove some possible interference;
s404, performing edge detection on the image after the interference is removed by using a Laplacian operator;
S405, carrying out binarization on the image by using a large law method;
and S406, recognizing characters in the image by using an OCR technology, and storing character information after recognition is finished.
Corresponding to the video subtitle extracting method, the invention also provides a video subtitle extracting device, which comprises:
the reading unit is used for reading the video of the subtitle to be detected;
the detection unit is used for detecting the subtitle frames in the video read by the reading unit based on the number of the angular points;
a positioning unit, configured to position the caption area in each caption frame detected by the detection unit;
and the extraction unit is used for extracting the caption from the caption area positioned by the positioning unit and carrying out optical character recognition to obtain caption characters.
In a possible embodiment, the detection unit is specifically configured to:
converting each frame of image in the video into a gray level image; carrying out angular point detection on each frame of image, and recording the number of angular points of each frame; taking the frame with the angular point number meeting the preset condition as a subtitle frame; the preset conditions include: the number of the angular points of the frame is greater than that of the angular points of the previous frame, and the number of the angular points of the frame is greater than 15; the absolute value of the difference between the number of the angular points of the frame and the previous frame is larger than the average value, and the average value is the average value of the absolute values of the difference between the number of the angular points of all the frames and the previous frame within 3 seconds after the frame.
In a possible implementation, the positioning unit is specifically configured to:
for each subtitle frame, converting the subtitle frame into a gray image; intercepting an image with one fourth of the height of the bottom of the gray level image; performing edge detection on the intercepted image by using a Laplacian operator; carrying out binarization processing on the image subjected to edge detection by using a Daohu method; closing the binarized image; and positioning the subtitle position in the image subjected to the closing operation by using the partial pixel accumulation method.
In a possible implementation, the extraction unit is specifically configured to:
reading coordinates of a subtitle position; intercepting a source image of the caption part according to the coordinates and converting the source image into a gray image; performing median filtering on the gray level image; performing edge detection on the image subjected to median filtering by using a Laplace algorithm; binarizing the image by using a large law method; and carrying out optical character recognition on the characters in the binarized image to obtain caption characters.
In a possible implementation, the locating unit is configured to locate the subtitle position in the image subjected to the closing operation using partial pixel addition, and includes:
The image subjected to the closing operation is centered and continuously positioned at [ L/2-L/40, L/2+ L/40 along the horizontal direction of the image]All pixel columns within the length range are set as the pixel value Oi
Starting with i equal to 1 and i increasing by 1, when OiWhen the SUM of the pixel and the lower continuous 20 pixel values is 1, O _ SUM is calculatedupWherein, O _ SUMup=Oi+Oi+1+Oi+2+Oi+3+...+Oi+20
When O _ SUMupWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the smallest y of all position coordinates recordediValue y ofiY being the top coordinate of the captionminA value;
starting from when i equals W and i is decremented by 1, when OiWhen the SUM of the pixel and the upper continuous 20 pixel values is 1, O _ SUM is calculateddownWherein, O _ SUMdown=Oi+Oi-1+Oi-2+Oi-3+...+Oi-20(ii) a W is the pixel width of the image, and L is the pixel length of the image;
when O _ SUMdownWhen > 10, record OiPosition coordinates (x)i,yi) Selecting all position coordinates of the recordMiddle maximum yiValue y ofiY being the bottom coordinate of the captionmaxA value;
selecting the image subjected to the closing operation in the vertical direction of the image as [ ymin,ymax]All pixel rows within the range are given a pixel value of Oj
Starting from the moment when j equals 1 and incrementing j by 1, the SUM O _ SUM of the right consecutive 20 pixel values is calculatedleftWherein, O _ SUMleft=Oj+Oj+1+Oj+2+Oj+3+...+Oj+20
When O _ SUMleftWhen > 10, record O jPosition coordinates (x)j,yj) Selecting the smallest x of all position coordinates of the recordjValue of xjX being the left-hand coordinate of the subtitleminA value;
starting from the moment when j equals L and j is decremented by 1, the SUM O _ SUM of the left consecutive 20 pixel values is calculatedrightWherein, O _ SUMright=Oj+Oj-1+Oj-2+Oj-3+...+Oj-20
When O _ SUMdownWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the largest x of all position coordinates of the recordjValue of xjX being the right coordinate of the captionmaxA value;
saving the position coordinates of the caption, wherein the coordinates of the upper left corner of the caption are (x)min+5,ymin+5) and the lower right corner coordinate is (x)max+5,ymax+5)。
In the embodiment of the invention, the number of the angular points is used for judging the subtitle frames, and in the judgment condition of the number of the angular points, the difference value of the number of the angular points between the two frames before and after is used for judging, so that the video frames can be accurately detected, and the condition that detection is possibly missed by other methods because the number of the subtitle characters is too small is effectively avoided; it is also possible to reduce the number of cases in which a transition frame in a video is determined to be a caption frame. The method has the advantages that the caption frame recognition is carried out by utilizing the characteristic that the number of the corner points is greatly changed when the caption appears, the condition that a plurality of caption frames containing the same caption are obtained when pixel difference calculation is used is effectively avoided, and therefore the subsequent calculation amount is reduced.
In the embodiment of the invention, when the subtitle region is detected, only the image with the height of one fourth of the bottom of the image is adopted for detection, so that the calculated amount is effectively reduced.
In the embodiment of the invention, the position of the caption is positioned by using a partial pixel accumulation method, and the method uses the traditional algorithm for calculation, so that the calculation amount is less than that of a machine learning-based method, and meanwhile, the detection effect can be effectively improved.
In the embodiments provided in the present invention, it should be understood that the disclosed technical contents can be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit may be a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for extracting a video subtitle, the method comprising:
reading a video of a subtitle to be detected;
detecting a subtitle frame in the video based on the number of the angular points;
positioning a caption area in each caption frame;
and extracting the caption from the positioned caption area and performing optical character recognition to obtain caption characters.
2. The method of claim 1, wherein the detecting the caption frame in the video based on the number of corners comprises:
converting each frame of image in the video into a gray level image;
detecting the angular points of each frame of image converted into the gray level image, and recording the number of the angular points of each frame;
Taking the frame with the angular point number meeting the preset condition as a subtitle frame; the preset conditions include: the number of the angular points of the frame is greater than that of the angular points of the previous frame, and the number of the angular points of the frame is greater than 15; the absolute value of the difference between the number of the angular points of the frame and the previous frame is larger than the average value, and the average value is the average value of the absolute values of the difference between the number of the angular points of all the frames and the previous frame within 3 seconds after the frame.
3. The method of claim 1, wherein the locating the caption area in each caption frame comprises:
for each subtitle frame, converting the subtitle frame into a gray image;
intercepting an image with one fourth of the height of the bottom of the gray level image;
performing edge detection on the intercepted image by using a Laplacian operator;
carrying out binarization processing on the image subjected to edge detection by using a Daohu method;
closing the binarized image;
and positioning the subtitle position in the image subjected to the closing operation by using the partial pixel accumulation method.
4. The method of claim 1, wherein extracting subtitles from the located subtitle regions and performing optical character recognition comprises:
Reading coordinates of a subtitle position;
intercepting a source image of the caption part according to the coordinates and converting the source image into a gray image;
performing median filtering on the gray level image;
performing edge detection on the image subjected to median filtering by using a Laplace algorithm;
binarizing the image by using a large law method;
and carrying out optical character recognition on the characters in the binarized image to obtain caption characters.
5. The method of claim 3, wherein the locating the caption position in the image subjected to the close operation using partial pixel accumulation comprises:
the image subjected to the closing operation is centered and continuously positioned at [ L/2-L/40, L/2+ L/40 along the horizontal direction of the image]All pixel columns within the length range are set as the pixel value Oi
Starting with i equal to 1 and i increasing by 1, when OiWhen the SUM of the pixel and the lower continuous 20 pixel values is 1, O _ SUM is calculatedupWherein, O _ SUMup=Oi+Oi+1+Oi+2+Oi+3+...+Oi+20
When O _ SUMupWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the smallest y of all position coordinates recordediValue y ofiY being the top coordinate of the captionminA value;
starting from when i equals W and i is decremented by 1, when OiWhen the SUM of the pixel and the upper continuous 20 pixel values is 1, O _ SUM is calculated downWherein, O _ SUMdown=Oi+Oi-1+Oi-2+Oi-3+...+Oi-20(ii) a W is the pixel width of the image, and L is the pixel length of the image;
when O _ SUMdownWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the largest y of all position coordinates recordediValue y ofiY being the bottom coordinate of the captionmaxA value;
selecting the image subjected to the closing operation in the vertical direction of the image as [ ymin,ymax]All pixel rows within the range are given a pixel value of Oj
Starting from the moment when j equals 1 and incrementing j by 1, the SUM O _ SUM of the right consecutive 20 pixel values is calculatedleftWherein, O _ SUMleft=Oj+Oj+1+Oj+2+Oj+3+...+Oj+20
When O _ SUMleftWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the smallest x of all position coordinates of the recordjValue of xjX being the left-hand coordinate of the subtitleminA value;
starting from the moment when j equals L and j is decremented by 1, the SUM O _ SUM of the left consecutive 20 pixel values is calculatedrightWherein, O _ SUMright=Oj+Oj-1+Oj-2+Oj-3+...+Oj-20
When O _ SUMdownWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the largest x of all position coordinates of the recordjValue of xjX being the right coordinate of the captionmaxA value;
saving the position coordinates of the caption, wherein the coordinates of the upper left corner of the caption are (x)min+5,ymin+5) and the lower right corner coordinate is (x)max+5,ymax+5)。
6. A video subtitle extracting apparatus, the apparatus comprising:
the reading unit is used for reading the video of the subtitle to be detected;
The detection unit is used for detecting the subtitle frames in the video read by the reading unit based on the number of the angular points;
a positioning unit, configured to position the caption area in each caption frame detected by the detection unit;
and the extraction unit is used for extracting the caption from the caption area positioned by the positioning unit and carrying out optical character recognition to obtain caption characters.
7. The apparatus according to claim 6, wherein the detection unit is specifically configured to:
converting each frame of image in the video into a gray level image; carrying out angular point detection on each frame of image, and recording the number of angular points of each frame; taking the frame with the angular point number meeting the preset condition as a subtitle frame; the preset conditions include: the number of the angular points of the frame is greater than that of the angular points of the previous frame, and the number of the angular points of the frame is greater than 15; the absolute value of the difference between the number of the angular points of the frame and the previous frame is larger than the average value, and the average value is the average value of the absolute values of the difference between the number of the angular points of all the frames and the previous frame within 3 seconds after the frame.
8. The apparatus according to claim 6, wherein the positioning unit is specifically configured to:
For each subtitle frame, converting the subtitle frame into a gray image; intercepting an image with one fourth of the height of the bottom of the gray level image; performing edge detection on the intercepted image by using a Laplacian operator; carrying out binarization processing on the image subjected to edge detection by using a Daohu method; closing the binarized image; and positioning the subtitle position in the image subjected to the closing operation by using the partial pixel accumulation method.
9. The apparatus according to claim 6, wherein the extraction unit is specifically configured to:
reading coordinates of a subtitle position; intercepting a source image of the caption part according to the coordinates and converting the source image into a gray image; performing median filtering on the gray level image; performing edge detection on the image subjected to median filtering by using a Laplace algorithm; binarizing the image by using a large law method; and carrying out optical character recognition on the characters in the binarized image to obtain caption characters.
10. The apparatus of claim 8, wherein the positioning unit is configured to position the caption position in the image subjected to the close operation using a partial pixel accumulation method, and comprises:
the image subjected to the closing operation is centered and continuously positioned at [ L/2-L/40, L/2+ L/40 along the horizontal direction of the image ]All pixel columns within the length range, set the image thereinThe value of element is Oi
Starting with i equal to 1 and i increasing by 1, when OiWhen the SUM of the pixel and the lower continuous 20 pixel values is 1, O _ SUM is calculatedupWherein, O _ SUMup=Oi+Oi+1+Oi+2+Oi+3+...+Oi+20
When O _ SUMupWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the smallest y of all position coordinates recordediValue y ofiY being the top coordinate of the captionminA value;
starting from when i equals W and i is decremented by 1, when OiWhen the SUM of the pixel and the upper continuous 20 pixel values is 1, O _ SUM is calculateddownWherein, O _ SUMdown=Oi+Oi-1+Oi-2+Oi-3+...+Oi-20(ii) a W is the pixel width of the image, and L is the pixel length of the image;
when O _ SUMdownWhen > 10, record OiPosition coordinates (x)i,yi) Selecting the largest y of all position coordinates recordediValue y ofiY being the bottom coordinate of the captionmaxA value;
selecting the image subjected to the closing operation in the vertical direction of the image as [ ymin,ymax]All pixel rows within the range are given a pixel value of Oj
Starting from the moment when j equals 1 and incrementing j by 1, the SUM O _ SUM of the right consecutive 20 pixel values is calculatedleftWherein, O _ SUMleft=Oj+Oj+1+Oj+2+Oj+3+...+Oj+20
When O _ SUMleftWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the smallest x of all position coordinates of the recordjValue of xjX being the left-hand coordinate of the subtitle minA value;
starting from the moment when j equals L and j is decremented by 1, the SUM O _ SUM of the left consecutive 20 pixel values is calculatedrightWherein,O_SUMright=Oj+Oj-1+Oj-2+Oj-3+...+Oj-20
When O _ SUMdownWhen > 10, record OjPosition coordinates (x)j,yj) Selecting the largest x of all position coordinates of the recordjValue of xjX being the right coordinate of the captionmaxA value;
saving the position coordinates of the caption, wherein the coordinates of the upper left corner of the caption are (x)min+5,ymin+5) and the lower right corner coordinate is (x)max+5,ymax+5)。
CN202010665068.7A 2020-07-10 2020-07-10 Video subtitle extraction method and device Active CN111860262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010665068.7A CN111860262B (en) 2020-07-10 2020-07-10 Video subtitle extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010665068.7A CN111860262B (en) 2020-07-10 2020-07-10 Video subtitle extraction method and device

Publications (2)

Publication Number Publication Date
CN111860262A true CN111860262A (en) 2020-10-30
CN111860262B CN111860262B (en) 2022-10-25

Family

ID=72984256

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010665068.7A Active CN111860262B (en) 2020-07-10 2020-07-10 Video subtitle extraction method and device

Country Status (1)

Country Link
CN (1) CN111860262B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580446A (en) * 2020-12-04 2021-03-30 北京中科凡语科技有限公司 Video subtitle translation method, system, electronic device and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853381A (en) * 2009-03-31 2010-10-06 华为技术有限公司 Method and device for acquiring video subtitle information
CN107302718A (en) * 2017-08-17 2017-10-27 河南科技大学 A kind of video caption area positioning method based on Corner Detection
CN108769776A (en) * 2018-05-31 2018-11-06 北京奇艺世纪科技有限公司 Main title detection method, device and electronic equipment
CN109918987A (en) * 2018-12-29 2019-06-21 中国电子科技集团公司信息科学研究院 A kind of video caption keyword recognition method and device
CN110944237A (en) * 2019-12-12 2020-03-31 成都极米科技股份有限公司 Subtitle area positioning method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853381A (en) * 2009-03-31 2010-10-06 华为技术有限公司 Method and device for acquiring video subtitle information
CN107302718A (en) * 2017-08-17 2017-10-27 河南科技大学 A kind of video caption area positioning method based on Corner Detection
CN108769776A (en) * 2018-05-31 2018-11-06 北京奇艺世纪科技有限公司 Main title detection method, device and electronic equipment
CN109918987A (en) * 2018-12-29 2019-06-21 中国电子科技集团公司信息科学研究院 A kind of video caption keyword recognition method and device
CN110944237A (en) * 2019-12-12 2020-03-31 成都极米科技股份有限公司 Subtitle area positioning method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周长建: ""基于多示例学习的视频字幕提取算法研究"", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580446A (en) * 2020-12-04 2021-03-30 北京中科凡语科技有限公司 Video subtitle translation method, system, electronic device and readable storage medium
CN112580446B (en) * 2020-12-04 2022-06-24 北京中科凡语科技有限公司 Video subtitle translation method, system, electronic device and readable storage medium

Also Published As

Publication number Publication date
CN111860262B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
US5784500A (en) Image binarization apparatus and method of it
US8761582B2 (en) Video editing device and video editing system
EP2357614B1 (en) Method and terminal for detecting and tracking moving object using real-time camera motion estimation
JP4643829B2 (en) System and method for analyzing video content using detected text in a video frame
US6366699B1 (en) Scheme for extractions and recognitions of telop characters from video data
Utsumi et al. An object detection method for describing soccer games from video
US7738734B2 (en) Image processing method
US9311533B2 (en) Device and method for detecting the presence of a logo in a picture
US20080044102A1 (en) Method and Electronic Device for Detecting a Graphical Object
CN101853381B (en) Method and device for acquiring video subtitle information
US7440608B2 (en) Method and system for detecting image defects
CN106570510A (en) Supermarket commodity identification method
US7720281B2 (en) Visual characteristics-based news anchorperson segment detection method
CN103606220A (en) Check printed number recognition system and check printed number recognition method based on white light image and infrared image
CN111860262B (en) Video subtitle extraction method and device
CN108235115A (en) The method and terminal of voice zone location in a kind of song-video
Phan et al. Recognition of video text through temporal integration
CN110807457A (en) OSD character recognition method, device and storage device
CN106101485B (en) A kind of prospect track determination method and device based on feedback
Wang et al. Automatic TV logo detection, tracking and removal in broadcast video
CN108363981B (en) Title detection method and device
JP5624702B2 (en) Image feature amount calculation apparatus and image feature amount calculation program
CN114025089A (en) Video image acquisition jitter processing method and system
JP2000048118A (en) Information reading system
Yen et al. Precise news video text detection/localization based on multiple frames integration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant