CN111539427B - Video subtitle extraction method and system - Google Patents

Video subtitle extraction method and system Download PDF

Info

Publication number
CN111539427B
CN111539427B CN202010356689.7A CN202010356689A CN111539427B CN 111539427 B CN111539427 B CN 111539427B CN 202010356689 A CN202010356689 A CN 202010356689A CN 111539427 B CN111539427 B CN 111539427B
Authority
CN
China
Prior art keywords
gray
area
caption
frame picture
pixels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010356689.7A
Other languages
Chinese (zh)
Other versions
CN111539427A (en
Inventor
李钦
王正航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Yimantianxia Technology Co ltd
Original Assignee
Shenzhen Youyou Brand Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Youyou Brand Communication Co ltd filed Critical Shenzhen Youyou Brand Communication Co ltd
Priority to CN202010356689.7A priority Critical patent/CN111539427B/en
Publication of CN111539427A publication Critical patent/CN111539427A/en
Application granted granted Critical
Publication of CN111539427B publication Critical patent/CN111539427B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention discloses a method and a system for extracting video captions, which relate to the field of image processing, and the method comprises the steps of selecting a specific area in a video picture as a caption identification area and selecting caption colors in the video picture; based on the determined caption identification area, cutting each frame of the video, and identifying the caption identification area of each frame of the video based on an image identification algorithm to judge whether the caption identification area of each frame of the video contains captions or not and judge whether caption identification areas of two adjacent frames of the video are similar or not; based on the judging result, classifying the frames which contain the same caption and are adjacent to each other into a group, and recording the time stamp of the head and tail frames in each group; and (3) performing OCR on the caption identification area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first and the last frames of the current group is the start time stamp and the end time stamp of the caption obtained currently, and generating a caption file. The invention can effectively save the extraction time of the video captions.

Description

Video subtitle extraction method and system
Technical Field
The invention relates to the field of image processing, in particular to a method and a system for extracting video subtitles.
Background
The subtitle refers to non-image content such as dialogue in television, movie and stage works, and also refers to characters added in the later stage of video-type movie works. The explanation characters and various characters appearing under the display screen of film screen, television set, etc., such as film name, staff list, gramophone, dialogue, explanatory words, etc., are called captions.
In practical applications, it is necessary to extract subtitles from video for some kind of usage. However, in the process of extracting video subtitles, the existing subtitle extraction method has the following disadvantages: 1. the time consumption is slow, and the subtitle extraction time is 5-10 times of original video time; 2. the generated caption time axis can not enter and exit with the same frame as the original video caption; 3. more manual operations are required to process the repeated frame images.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for extracting video subtitles, which can effectively save the extraction time of the video subtitles.
In order to achieve the above object, the present invention provides a method for extracting video subtitles, comprising the following steps:
selecting a specific area in a video picture as a caption identification area, and selecting caption colors in the video picture;
based on the determined caption identification area, cutting each frame of the video, and identifying the caption identification area of each frame of the video based on an image identification algorithm to judge whether the caption identification area of each frame of the video contains captions or not and judge whether caption identification areas of two adjacent frames of the video are similar or not;
based on the judging result, classifying the frames which contain the same caption and are adjacent to each other into a group, and recording the time stamp of the head and tail frames in each group;
and (3) performing OCR on the caption identification area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first and the last frames of the current group is the start time stamp and the end time stamp of the caption obtained currently, and generating a caption file.
On the basis of the technical proposal, the method comprises the following steps,
judging whether the caption identification area of each frame picture contains captions, wherein the judging mode comprises a global judging mode and a local judging mode;
the global judgment mode comprises the following steps:
converting the caption identification area of the current frame picture into a gray image;
reading a gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image, wherein gray is a preset gray value, and the value range is 0-255;
based on the obtained number, if the obtained number is larger than 3*h, the caption identification area of the current frame picture contains captions, otherwise, the caption identification area of the current frame picture does not contain captions, wherein h is the height of the gray level image;
the local judging mode comprises the following steps:
cutting the subtitle identification area of the current frame picture by using a preset cutting area to obtain a cutting image;
converting the clipping image into a gray image, and then reading the gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image;
based on the obtained number, if the obtained number belongs to [ cw, cw×ch/2], it indicates that the caption identification area of the current frame picture contains captions, otherwise, it indicates that the caption identification area of the current frame picture does not contain captions, where cw represents the width of the clip image and ch represents the height of the clip image.
On the basis of the above technical solution, the step of determining the preset clipping region includes:
transversely segmenting a caption identification area of a first frame picture in each group to obtain a plurality of unit areas which are identical in shape and square, storing by using an array, and storing the number of effective pixel points in the unit area of the caption identification area of one frame picture in each array;
judging the number of effective pixels of each unit area in a single subtitle identification area, if the number of the effective pixels of the current unit area meets [ h1, h1 x h/2], adding 1 to the weight value of the current unit area compared with the weight value of the last unit area, and if the number of the effective pixels of the current unit area does not meet [ h1, h1 x h/2], keeping the weight value of the current unit area consistent with the weight value of the last unit area, wherein the effective pixels refer to pixel points with gray values belonging to [ gray-15, gray+15], and h1 is the side length of the unit area;
dividing all unit areas of a caption identification area of a current frame picture into a left part and a right part, calculating the sum of weights of the unit areas of each part, judging whether the absolute value-right absolute value/min { left, right } is larger than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a centered aligned caption, wherein left represents the sum of weight values of the unit areas of the left part, and right represents the sum of weight values of the unit areas of the right part;
for the frame picture of the left aligned caption, finding out the unit area with the maximum weight value in the single caption identification area and the next unit area adjacent to the unit area, and merging the two found unit areas to obtain an area which is a preset cutting area; and for the frame picture with centered and aligned subtitles, finding out a unit area with the maximum weight value in the single subtitle identification area and a front unit area and a rear unit area adjacent to the unit area, and merging the found three unit areas to obtain an area which is a preset cutting area.
Based on the above technical solution, the specific judging process includes:
the caption identification areas of two adjacent frames of pictures are converted into gray images, so that two gray images are obtained;
reading two gray images from pixel points to obtain the number of pixel points with gray values belonging to [ gray-15, gray+15] in the two gray images;
based on the number of the obtained products,
if the number of pixels with gray values belonging to [ gray-15, gray+15] in the two gray images is 0, the caption identification areas of the two adjacent frames are dissimilar;
if diff/(valid 1+valid 2) <0.3, the caption recognition areas of the two current adjacent frames are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in one gray image, and valid2 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in the other Zhang Huidu image, diff represents the number of times that the pixels at the same position in two gray images are not valid pixels or invalid pixels at the same time, and the valid pixels refer to pixels with gray values belonging to [ gray-15, gray+15], and the invalid pixels refer to pixels with gray values not belonging to [ gray-15, gray+15 ];
if diff/(valid 1+ valid 2) is not less than 0.3, the subtitle recognition areas of the current two adjacent frames are not similar.
Based on the technical scheme, the method comprises the specific steps of performing OCR on the caption identification area of the first frame picture in each group to obtain captions, wherein the specific steps comprise:
longitudinally splicing caption identification areas of the first frame picture in each group according to the time sequence to form spliced pictures, wherein in the spliced pictures, the time stamp of the first and the last frame of the group where the caption identification area is drawn above each caption identification area;
OCR is carried out on the spliced pictures, and the obtained text contents are combined according to the time sequence to form texts;
and analyzing the text to obtain all the subtitles, and a starting time stamp of each subtitle.
The invention provides a video subtitle extraction system, which comprises the following steps:
the selecting module is used for selecting a specific area in the video picture as a caption identification area and selecting caption colors in the video picture;
the judging module is used for cutting each frame of picture of the video based on the determined caption identification area, identifying the caption identification area of each frame of picture based on an image identification algorithm so as to judge whether the caption identification area of each frame of picture contains captions or not and judge whether caption identification areas of two adjacent frames of pictures are similar or not;
the classifying module is used for classifying frames which contain the same subtitle and are adjacent to each other into a group based on the judging result, and recording the time stamp of the head and tail frames in each group;
and the recognition module is used for performing OCR on the caption recognition area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first and the last frames of the current group is the start time stamp and the end time stamp of the current obtained caption, and a caption file is generated.
On the basis of the technical proposal, the method comprises the following steps,
judging whether the caption identification area of each frame picture contains captions, wherein the judging mode comprises a global judging mode and a local judging mode;
the global judgment mode comprises the following steps:
converting the caption identification area of the current frame picture into a gray image;
reading a gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image, wherein gray is a preset gray value, and the value range is 0-255;
based on the obtained number, if the obtained number is larger than 3*h, the caption identification area of the current frame picture contains captions, otherwise, the caption identification area of the current frame picture does not contain captions, wherein h is the height of the gray level image;
the local judging mode comprises the following steps:
cutting the subtitle identification area of the current frame picture by using a preset cutting area to obtain a cutting image;
converting the clipping image into a gray image, and then reading the gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image;
based on the obtained number, if the obtained number belongs to [ cw, cw×ch/2], it indicates that the caption identification area of the current frame picture contains captions, otherwise, it indicates that the caption identification area of the current frame picture does not contain captions, where cw represents the width of the clip image and ch represents the height of the clip image.
On the basis of the above technical solution, the clipping method uses a preset clipping region to clip the subtitle identification region of the current frame picture, where the determining process of the preset clipping region includes:
transversely segmenting a caption identification area of a first frame picture in each group to obtain a plurality of unit areas which are identical in shape and square, storing by using an array, and storing the number of effective pixel points in the unit area of the caption identification area of one frame picture in each array;
judging the number of effective pixels of each unit area in a single subtitle identification area, if the number of the effective pixels of the current unit area meets [ h1, h1 x h/2], adding 1 to the weight value of the current unit area compared with the weight value of the last unit area, and if the number of the effective pixels of the current unit area does not meet [ h1, h1 x h/2], keeping the weight value of the current unit area consistent with the weight value of the last unit area, wherein the effective pixels refer to pixel points with gray values belonging to [ gray-15, gray+15], and h1 is the side length of the unit area;
dividing all unit areas of a caption identification area of a current frame picture into a left part and a right part, calculating the sum of weights of the unit areas of each part, judging whether the absolute value-right absolute value/min { left, right } is larger than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a centered aligned caption, wherein left represents the sum of weight values of the unit areas of the left part, and right represents the sum of weight values of the unit areas of the right part;
for the frame picture of the left aligned caption, finding out the unit area with the maximum weight value in the single caption identification area and the next unit area adjacent to the unit area, and merging the two found unit areas to obtain an area which is a preset cutting area; and for the frame picture with centered and aligned subtitles, finding out a unit area with the maximum weight value in the single subtitle identification area and a front unit area and a rear unit area adjacent to the unit area, and merging the found three unit areas to obtain an area which is a preset cutting area.
Based on the above technical solution, the specific judging process includes:
the caption identification areas of two adjacent frames of pictures are converted into gray images, so that two gray images are obtained;
reading two gray images from pixel points to obtain the number of pixel points with gray values belonging to [ gray-15, gray+15] in the two gray images;
based on the number of the obtained products,
if the number of pixels with gray values belonging to [ gray-15, gray+15] in the two gray images is 0, the caption identification areas of the two adjacent frames are dissimilar;
if diff/(valid 1+valid 2) <0.3, the caption recognition areas of the two current adjacent frames are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in one gray image, and valid2 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in the other Zhang Huidu image, diff represents the number of times that the pixels at the same position in two gray images are not valid pixels or invalid pixels at the same time, and the valid pixels refer to pixels with gray values belonging to [ gray-15, gray+15], and the invalid pixels refer to pixels with gray values not belonging to [ gray-15, gray+15 ];
if diff/(valid 1+ valid 2) is not less than 0.3, the subtitle recognition areas of the current two adjacent frames are not similar.
Based on the technical scheme, the method comprises the steps of performing OCR on the caption identification area of the first frame picture in each group to obtain the caption, wherein the specific process comprises the following steps:
longitudinally splicing caption identification areas of the first frame picture in each group according to the time sequence to form spliced pictures, wherein in the spliced pictures, the time stamp of the first and the last frame of the group where the caption identification area is drawn above each caption identification area;
OCR is carried out on the spliced pictures, and the obtained text contents are combined according to the time sequence to form texts;
and analyzing the text to obtain all the subtitles, and a starting time stamp of each subtitle.
Compared with the prior art, the invention has the advantages that: the specific area in the video picture is selected as the caption identification area, and the identification area is reduced, so that the extraction time of the video caption is effectively saved, manual intervention is less, only the caption identification area and the caption color are required to be selected manually, and the time stamp of the picture where the caption appears is recorded in the caption extraction process, so that the generated caption time axis and the original video caption are ensured to enter and exit in the same frame.
Drawings
Fig. 1 is a flowchart of a method for extracting video subtitles according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a video subtitle extraction method, which reduces the picture identification range by selecting a specific area in a video picture as a subtitle identification area, thereby effectively improving the subtitle extraction speed of the video picture. The embodiment of the invention correspondingly provides a video subtitle extraction system. The present invention will be described in further detail with reference to the accompanying drawings and examples.
Referring to fig. 1, the method for extracting video subtitles provided by the embodiment of the invention includes the following steps:
s1: and selecting a specific area in the video picture as a caption identification area, and selecting caption colors in the video picture. In the embodiment of the invention, the selection of the specific area and the selection of the caption color can be manually performed, the area where the caption appears in the video picture is generally a fixed area, the caption always appears in the fixed area along with the progress of video playing, and the picture identification range can be effectively reduced by selecting the caption identification area.
S2: based on the determined caption identification area, each frame of the video is cut, and based on an image identification algorithm, the caption identification area of each frame of the video is identified, so as to judge whether the caption identification area of each frame of the video contains captions or not, and judge whether the caption identification areas of two adjacent frames of the video are similar or not.
In the embodiment of the invention, whether the caption identification area of each frame picture contains captions is judged, wherein the judging mode comprises a global judging mode and a local judging mode.
The global judgment method comprises the following steps:
s201: converting the caption identification area of the current frame picture into a gray image;
s202: reading a gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image, wherein gray is a preset gray value, and the value range is 0-255;
s203: based on the obtained number, if the obtained number is larger than 3*h, the caption identification area of the current frame picture contains captions, otherwise, the caption identification area of the current frame picture does not contain captions, wherein h is the high of the gray level image, and multiplication is represented.
The local judging mode comprises the following steps:
s211: cutting the subtitle identification area of the current frame picture by using a preset cutting area to obtain a cutting image;
s212: converting the clipping image into a gray image, and then reading the gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image;
s213: based on the obtained number, if the obtained number belongs to [ cw, cw×ch/2], it indicates that the caption identification area of the current frame picture contains captions, otherwise, it indicates that the caption identification area of the current frame picture does not contain captions, where cw represents the width of the clip image and ch represents the height of the clip image.
In the embodiment of the invention, a subtitle identification area of a current frame picture is cut by using a preset cutting area, wherein the step of determining the preset cutting area comprises the following steps:
a: transversely segmenting the caption identification area of the first frame picture in each group to obtain a plurality of unit areas which are identical in shape and square, storing by using an array, and storing the number of effective pixel points in the unit area of the caption identification area of one frame picture in each array. The side length of the clipping unit area is the same as the height of the subtitle recognition area.
B: judging the number of effective pixels of each unit area in a single subtitle identification area, if the number of the effective pixels of the current unit area meets [ h1, h1 x h/2], adding 1 to the weight value of the current unit area compared with the weight value of the last unit area, and if the number of the effective pixels of the current unit area does not meet [ h1, h1 x h/2], keeping the weight value of the current unit area consistent with the weight value of the last unit area, wherein the effective pixels refer to pixel points with gray values belonging to [ gray-15, gray+15], and h1 is the side length of the unit area.
For example, after a single subtitle recognition area is transversely segmented, 4 unit areas, namely a unit area a, a unit area b, a unit area c and a unit area d, are sequentially obtained, if the number of effective pixels in the unit area a satisfies [ h1, h1 x h/2], the weight value of the unit area a is 1, if the number of effective pixels in the unit area b satisfies [ h1, h1 x h/2], the weight value of the unit area b is 2, if the number of effective pixels in the unit area c does not satisfy [ h1, h1 x h/2], the weight value of the unit area c is 2, and if the number of effective pixels in the unit area d satisfies [ h1, h1 x h/2], the weight value of the unit area d is 3.
C: dividing all unit areas of the caption identification area of the current frame picture into a left part and a right part, calculating the sum of the weights of the unit areas of each part, judging whether the absolute value-right absolute/min { left, right } is larger than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a centered aligned caption, wherein left represents the sum of the weight values of the unit areas of the left part, and right represents the sum of the weight values of the unit areas of the right part.
For example, a caption identification area of a certain frame includes 4 unit areas, namely, a unit area a, a unit area b, a unit area c and a unit area d, and after the left and right portions are divided, the left portion includes the unit area a and the unit area b, the right portion includes the unit area c and the unit area d, left represents the sum of the weight value of the unit area a and the weight value of the unit area b, and right represents the sum of the weight value of the unit area c and the weight value of the unit area d.
D: for the frame picture of the left aligned caption, finding out the unit area with the maximum weight value in the single caption identification area and the next unit area adjacent to the unit area, and merging the two found unit areas to obtain an area which is a preset cutting area; and for the frame picture with centered and aligned subtitles, finding out a unit area with the maximum weight value in the single subtitle identification area and a front unit area and a rear unit area adjacent to the unit area, and merging the found three unit areas to obtain an area which is a preset cutting area.
For example, the caption identification area of a certain frame picture includes 4 unit areas, namely a unit area a, a unit area b, a unit area c and a unit area d in sequence, wherein the weight value of the unit area c is the largest, if the current frame picture is a left aligned caption, the preset clipping area is an area obtained by combining the unit area c and the unit area d, and if the current frame picture is a centered aligned caption, the preset clipping area is an area obtained by combining the unit area b, the unit area c and the unit area d.
The method is characterized in that whether the frame picture contains characters or not is determined by determining the preset cutting area, whether the caption characters only have one character or a plurality of characters, the characters fall in the preset cutting area, so that effective point sampling is only needed in the preset cutting area, sampling is not needed in the whole caption identification area, and the influence of background noise points on pixel sampling can be effectively reduced.
In the embodiment of the invention, whether the caption identification areas of the front and rear adjacent two frames of pictures are similar is judged, and the specific judging process comprises the following steps:
s231: the caption identification areas of two adjacent frames of pictures are converted into gray images, so that two gray images are obtained;
s232: reading two gray images from pixel points to obtain the number of pixel points with gray values belonging to [ gray-15, gray+15] in the two gray images;
s233: based on the number of the obtained products,
if the number of pixels with gray values belonging to [ gray-15, gray+15] in the two gray images is 0, the caption identification areas of the two adjacent frames are dissimilar;
if diff/(valid 1+valid 2) <0.3, the caption recognition areas of the two current adjacent frames are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in one gray image, and valid2 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in the other Zhang Huidu image, diff represents the number of times that the pixels at the same position in two gray images are not valid pixels or invalid pixels at the same time, the valid pixels refer to the pixels with gray values belonging to [ gray-15, gray+15], and the invalid pixels refer to the pixels with gray values not belonging to [ gray-15, gray+15 ];
if diff/(valid 1+ valid 2) is not less than 0.3, the subtitle recognition areas of the current two adjacent frames are not similar.
S3: based on the judging result, classifying the frames which contain the same caption and are adjacent to each other into a group, and recording the time stamp of the head and tail frames in each group.
In the embodiment of the invention, the frames which contain the same caption and are adjacent to each other in the video are grouped into one group, and the time stamp of the head and tail frames in each group is recorded, and the specific steps comprise: sequentially judging the caption identification area of each frame of picture, if the current caption identification area contains the caption, recording the caption identification area of the current frame of picture, recording the time stamp of the current frame of picture, then judging the caption identification area of the next frame of picture, judging whether the caption identification area contains characters or not, and judging whether the caption identification area is similar to the caption identification area of the previous frame of picture or not:
if the characters are contained and similar, continuing to judge the caption identification area of the next frame of picture; if the characters are contained and are dissimilar, recording a caption identification area of the current frame picture, and recording a time stamp of the current frame picture; if the text is not contained, recording the time stamp of the current frame picture; and so on, the video contains the same subtitle and adjacent frames are grouped together. The caption identification area containing the same caption refers to the frame picture containing the characters and the characters are the same.
S4: and performing OCR (Optical Character Recognition ) on the caption recognition area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first and last frames of the current group is the start time stamp and the end time stamp of the current obtained caption, and generating a caption file.
OCR is carried out on the caption identification area of the first frame picture in each group to obtain captions, and the specific steps comprise:
s401: longitudinally splicing caption identification areas of the first frame picture in each group according to the time sequence to form spliced pictures, wherein in the spliced pictures, the time stamp of the first and the last frame of the group where the caption identification area is drawn above each caption identification area;
s402: OCR is carried out on the spliced pictures, and the obtained text contents are combined according to the time sequence to form texts;
s403: and analyzing the text to obtain all the subtitles, and a start time stamp of each subtitle, and outputting the subtitle file according to a srt format.
According to the video subtitle extraction method, the specific area in the video picture is selected as the subtitle identification area, and the identification area is reduced, so that the video subtitle extraction time is effectively saved, manual intervention is less, only the subtitle identification area and the subtitle color are needed to be selected manually, the time stamp of the picture with the subtitle is recorded in the subtitle extraction process, and the generated subtitle time axis and the original video subtitle are ensured to enter and exit in the same frame.
The embodiment of the invention provides a video subtitle extraction system which comprises a selection module, a judgment module, a classification module and an identification module.
The selecting module is used for selecting a specific area in the video picture as a caption identification area and selecting caption colors in the video picture; the judging module is used for cutting each frame of picture of the video based on the determined caption identification area, identifying the caption identification area of each frame of picture based on an image identification algorithm so as to judge whether the caption identification area of each frame of picture contains captions or not and judge whether caption identification areas of two adjacent frames of pictures are similar or not; the classifying module is used for classifying the frames which contain the same caption and are adjacent to each other into a group based on the judging result, and recording the time stamp of the head and tail frames in each group; the recognition module is used for performing OCR on the caption recognition area of the first frame picture in each group to obtain the caption, and the time stamp of the first and the last frame of the current group is the starting time stamp and the ending time stamp of the caption obtained currently, and a caption file is generated.
In the embodiment of the invention, whether the caption identification area of each frame of picture contains captions is judged, wherein the judging mode comprises a global judging mode and a local judging mode;
the global judgment mode comprises the following steps:
converting the caption identification area of the current frame picture into a gray image;
reading a gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image, wherein gray is a preset gray value, and the value range is 0-255;
based on the obtained number, if the obtained number is larger than 3*h, the caption identification area of the current frame picture contains captions, otherwise, the caption identification area of the current frame picture does not contain captions, wherein h is the height of the gray level image;
the local judgment mode comprises the following steps:
cutting the subtitle identification area of the current frame picture by using a preset cutting area to obtain a cutting image;
converting the clipping image into a gray image, and then reading the gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image;
based on the obtained number, if the obtained number belongs to [ cw, cw×ch/2], it indicates that the caption identification area of the current frame picture contains captions, otherwise, it indicates that the caption identification area of the current frame picture does not contain captions, where cw represents the width of the clip image and ch represents the height of the clip image.
In the embodiment of the invention, a subtitle identification area of a current frame picture is cut by using a preset cutting area, wherein the determining process of the preset cutting area comprises the following steps:
transversely segmenting a caption identification area of a first frame picture in each group to obtain a plurality of unit areas which are identical in shape and square, storing by using an array, and storing the number of effective pixel points in the unit area of the caption identification area of one frame picture in each array;
judging the number of effective pixels of each unit area in a single subtitle identification area, if the number of the effective pixels of the current unit area meets [ h1, h1 x h/2], adding 1 to the weight value of the current unit area compared with the weight value of the last unit area, and if the number of the effective pixels of the current unit area does not meet [ h1, h1 x h/2], keeping the weight value of the current unit area consistent with the weight value of the last unit area, wherein the effective pixels refer to pixel points with gray values belonging to [ gray-15, gray+15], and h1 is the side length of the unit area;
dividing all unit areas of a caption identification area of a current frame picture into a left part and a right part, calculating the sum of weights of the unit areas of each part, judging whether the absolute value-right absolute value/min { left, right } is larger than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a centered aligned caption, wherein left represents the sum of weight values of the unit areas of the left part, and right represents the sum of weight values of the unit areas of the right part;
for the frame picture of the left aligned caption, finding out the unit area with the maximum weight value in the single caption identification area and the next unit area adjacent to the unit area, and merging the two found unit areas to obtain an area which is a preset cutting area; and for the frame picture with centered and aligned subtitles, finding out a unit area with the maximum weight value in the single subtitle identification area and a front unit area and a rear unit area adjacent to the unit area, and merging the found three unit areas to obtain an area which is a preset cutting area.
In the embodiment of the invention, whether the caption identification areas of the front and rear adjacent two frames of pictures are similar is judged, and the specific judging process comprises the following steps:
the caption identification areas of two adjacent frames of pictures are converted into gray images, so that two gray images are obtained;
reading two gray images from pixel points to obtain the number of pixel points with gray values belonging to [ gray-15, gray+15] in the two gray images;
based on the number of the obtained products,
if the number of pixels with gray values belonging to [ gray-15, gray+15] in the two gray images is 0, the caption identification areas of the two adjacent frames are dissimilar;
if diff/(valid 1+valid 2) <0.3, the caption recognition areas of the two current adjacent frames are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in one gray image, and valid2 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in the other Zhang Huidu image, diff represents the number of times that the pixels at the same position in two gray images are not valid pixels or invalid pixels at the same time, and the valid pixels refer to pixels with gray values belonging to [ gray-15, gray+15], and the invalid pixels refer to pixels with gray values not belonging to [ gray-15, gray+15 ];
if diff/(valid 1+ valid 2) is not less than 0.3, the subtitle recognition areas of the current two adjacent frames are not similar.
In the embodiment of the invention, OCR is performed on the caption identification area of the first frame picture in each group to obtain the caption, and the specific process comprises the following steps:
longitudinally splicing caption identification areas of the first frame picture in each group according to the time sequence to form spliced pictures, wherein in the spliced pictures, the time stamp of the first and the last frame of the group where the caption identification area is drawn above each caption identification area;
OCR is carried out on the spliced pictures, and the obtained text contents are combined according to the time sequence to form texts;
and analyzing the text to obtain all the subtitles, and a starting time stamp of each subtitle.
The invention is not limited to the embodiments described above, but a number of modifications and adaptations can be made by a person skilled in the art without departing from the principle of the invention, which modifications and adaptations are also considered to be within the scope of the invention. What is not described in detail in this specification is prior art known to those skilled in the art.

Claims (8)

1. The method for extracting the video subtitle is characterized by comprising the following steps of:
selecting a specific area in a video picture as a caption identification area, and selecting caption colors in the video picture;
based on the determined caption identification area, cutting each frame of the video, and identifying the caption identification area of each frame of the video based on an image identification algorithm to judge whether the caption identification area of each frame of the video contains captions or not and judge whether caption identification areas of two adjacent frames of the video are similar or not;
based on the judging result, classifying the frames which contain the same caption and are adjacent to each other into a group, and recording the time stamp of the head and tail frames in each group;
performing OCR on a caption identification area of a first frame picture in each group to obtain captions, wherein the time stamp of the first and the last frames of the current group is the start time stamp and the end time stamp of the current obtained captions, and generating caption files;
wherein, the liquid crystal display device comprises a liquid crystal display device,
judging whether the caption identification area of each frame picture contains captions, wherein the judging mode comprises a global judging mode and a local judging mode;
the global judgment mode comprises the following steps:
converting the caption identification area of the current frame picture into a gray image;
reading a gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image, wherein gray is a preset gray value, and the value range is 0-255;
based on the obtained number, if the obtained number is larger than 3*h, the caption identification area of the current frame picture contains captions, otherwise, the caption identification area of the current frame picture does not contain captions, wherein h is the height of the gray level image;
the local judging mode comprises the following steps:
cutting the subtitle identification area of the current frame picture by using a preset cutting area to obtain a cutting image;
converting the clipping image into a gray image, and then reading the gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image;
based on the obtained number, if the obtained number belongs to [ cw, cw×ch/2], it indicates that the caption identification area of the current frame picture contains captions, otherwise, it indicates that the caption identification area of the current frame picture does not contain captions, where cw represents the width of the clip image and ch represents the height of the clip image.
2. The method for extracting video subtitles of claim 1, wherein the subtitle identification area of the current frame picture is cropped using a preset cropping area, and wherein the determining of the preset cropping area comprises:
transversely segmenting a caption identification area of a first frame picture in each group to obtain a plurality of unit areas which are identical in shape and square, storing by using an array, and storing the number of effective pixel points in the unit area of the caption identification area of one frame picture in each array;
judging the number of effective pixels of each unit area in a single subtitle identification area, if the number of the effective pixels of the current unit area meets [ h1, h1 x h/2], adding 1 to the weight value of the current unit area compared with the weight value of the last unit area, and if the number of the effective pixels of the current unit area does not meet [ h1, h1 x h/2], keeping the weight value of the current unit area consistent with the weight value of the last unit area, wherein the effective pixels refer to pixel points with gray values belonging to [ gray-15, gray+15], and h1 is the side length of the unit area;
dividing all unit areas of a caption identification area of a current frame picture into a left part and a right part, calculating the sum of weights of the unit areas of each part, judging whether the absolute value-right absolute value/min { left, right } is larger than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a centered aligned caption, wherein left represents the sum of weight values of the unit areas of the left part, and right represents the sum of weight values of the unit areas of the right part;
for the frame picture of the left aligned caption, finding out the unit area with the maximum weight value in the single caption identification area and the next unit area adjacent to the unit area, and merging the two found unit areas to obtain an area which is a preset cutting area; and for the frame picture with centered and aligned subtitles, finding out a unit area with the maximum weight value in the single subtitle identification area and a front unit area and a rear unit area adjacent to the unit area, and merging the found three unit areas to obtain an area which is a preset cutting area.
3. The method for extracting video subtitles of claim 1, wherein: the specific judging process includes the following steps:
the caption identification areas of two adjacent frames of pictures are converted into gray images, so that two gray images are obtained;
reading two gray images from pixel points to obtain the number of pixel points with gray values belonging to [ gray-15, gray+15] in the two gray images;
based on the number of the obtained products,
if the number of pixels with gray values belonging to [ gray-15, gray+15] in the two gray images is 0, the caption identification areas of the two adjacent frames are dissimilar;
if diff/(valid 1+valid 2) <0.3, the caption recognition areas of the two current adjacent frames are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in one gray image, and valid2 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in the other Zhang Huidu image, diff represents the number of times that the pixels at the same position in two gray images are not valid pixels or invalid pixels at the same time, and the valid pixels refer to pixels with gray values belonging to [ gray-15, gray+15], and the invalid pixels refer to pixels with gray values not belonging to [ gray-15, gray+15 ];
if diff/(valid 1+ valid 2) is not less than 0.3, the subtitle recognition areas of the current two adjacent frames are not similar.
4. The method for extracting video subtitles of claim 1, wherein said OCR is performed on a subtitle recognition area of a first frame picture in each group to obtain subtitles, and the specific steps include:
longitudinally splicing caption identification areas of the first frame picture in each group according to the time sequence to form spliced pictures, wherein in the spliced pictures, the time stamp of the first and the last frame of the group where the caption identification area is drawn above each caption identification area;
OCR is carried out on the spliced pictures, and the obtained text contents are combined according to the time sequence to form texts;
and analyzing the text to obtain all the subtitles, and a starting time stamp of each subtitle.
5. A video subtitle extraction system, comprising the steps of:
the selecting module is used for selecting a specific area in the video picture as a caption identification area and selecting caption colors in the video picture;
the judging module is used for cutting each frame of picture of the video based on the determined caption identification area, identifying the caption identification area of each frame of picture based on an image identification algorithm so as to judge whether the caption identification area of each frame of picture contains captions or not and judge whether caption identification areas of two adjacent frames of pictures are similar or not;
the classifying module is used for classifying frames which contain the same subtitle and are adjacent to each other into a group based on the judging result, and recording the time stamp of the head and tail frames in each group;
the recognition module is used for performing OCR on the caption recognition area of the first frame picture in each group to obtain captions, and the time stamp of the first and the last frame of the current group is the start time stamp and the end time stamp of the caption currently obtained, and a caption file is generated;
wherein, the liquid crystal display device comprises a liquid crystal display device,
judging whether the caption identification area of each frame picture contains captions, wherein the judging mode comprises a global judging mode and a local judging mode;
the global judgment mode comprises the following steps:
converting the caption identification area of the current frame picture into a gray image;
reading a gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image, wherein gray is a preset gray value, and the value range is 0-255;
based on the obtained number, if the obtained number is larger than 3*h, the caption identification area of the current frame picture contains captions, otherwise, the caption identification area of the current frame picture does not contain captions, wherein h is the height of the gray level image;
the local judging mode comprises the following steps:
cutting the subtitle identification area of the current frame picture by using a preset cutting area to obtain a cutting image;
converting the clipping image into a gray image, and then reading the gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image;
based on the obtained number, if the obtained number belongs to [ cw, cw×ch/2], it indicates that the caption identification area of the current frame picture contains captions, otherwise, it indicates that the caption identification area of the current frame picture does not contain captions, where cw represents the width of the clip image and ch represents the height of the clip image.
6. The video subtitle extraction system of claim 5, wherein the cropping the subtitle identification area of the current frame picture using a preset cropping area, wherein the determining of the preset cropping area comprises:
transversely segmenting a caption identification area of a first frame picture in each group to obtain a plurality of unit areas which are identical in shape and square, storing by using an array, and storing the number of effective pixel points in the unit area of the caption identification area of one frame picture in each array;
judging the number of effective pixels of each unit area in a single subtitle identification area, if the number of the effective pixels of the current unit area meets [ h1, h1 x h/2], adding 1 to the weight value of the current unit area compared with the weight value of the last unit area, and if the number of the effective pixels of the current unit area does not meet [ h1, h1 x h/2], keeping the weight value of the current unit area consistent with the weight value of the last unit area, wherein the effective pixels refer to pixel points with gray values belonging to [ gray-15, gray+15], and h1 is the side length of the unit area;
dividing all unit areas of a caption identification area of a current frame picture into a left part and a right part, calculating the sum of weights of the unit areas of each part, judging whether the absolute value-right absolute value/min { left, right } is larger than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a centered aligned caption, wherein left represents the sum of weight values of the unit areas of the left part, and right represents the sum of weight values of the unit areas of the right part;
for the frame picture of the left aligned caption, finding out the unit area with the maximum weight value in the single caption identification area and the next unit area adjacent to the unit area, and merging the two found unit areas to obtain an area which is a preset cutting area; and for the frame picture with centered and aligned subtitles, finding out a unit area with the maximum weight value in the single subtitle identification area and a front unit area and a rear unit area adjacent to the unit area, and merging the found three unit areas to obtain an area which is a preset cutting area.
7. The video subtitle extraction system of claim 5, wherein: the specific judging process includes the following steps:
the caption identification areas of two adjacent frames of pictures are converted into gray images, so that two gray images are obtained;
reading two gray images from pixel points to obtain the number of pixel points with gray values belonging to [ gray-15, gray+15] in the two gray images;
based on the number of the obtained products,
if the number of pixels with gray values belonging to [ gray-15, gray+15] in the two gray images is 0, the caption identification areas of the two adjacent frames are dissimilar;
if diff/(valid 1+valid 2) <0.3, the caption recognition areas of the two current adjacent frames are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in one gray image, and valid2 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in the other Zhang Huidu image, diff represents the number of times that the pixels at the same position in two gray images are not valid pixels or invalid pixels at the same time, and the valid pixels refer to pixels with gray values belonging to [ gray-15, gray+15], and the invalid pixels refer to pixels with gray values not belonging to [ gray-15, gray+15 ];
if diff/(valid 1+ valid 2) is not less than 0.3, the subtitle recognition areas of the current two adjacent frames are not similar.
8. The system for extracting video subtitles of claim 5 wherein said OCR is performed on a subtitle recognition area of a first frame of each group to obtain subtitles, the specific process comprising:
longitudinally splicing caption identification areas of the first frame picture in each group according to the time sequence to form spliced pictures, wherein in the spliced pictures, the time stamp of the first and the last frame of the group where the caption identification area is drawn above each caption identification area;
OCR is carried out on the spliced pictures, and the obtained text contents are combined according to the time sequence to form texts;
and analyzing the text to obtain all the subtitles, and a starting time stamp of each subtitle.
CN202010356689.7A 2020-04-29 2020-04-29 Video subtitle extraction method and system Active CN111539427B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010356689.7A CN111539427B (en) 2020-04-29 2020-04-29 Video subtitle extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010356689.7A CN111539427B (en) 2020-04-29 2020-04-29 Video subtitle extraction method and system

Publications (2)

Publication Number Publication Date
CN111539427A CN111539427A (en) 2020-08-14
CN111539427B true CN111539427B (en) 2023-07-21

Family

ID=71967604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010356689.7A Active CN111539427B (en) 2020-04-29 2020-04-29 Video subtitle extraction method and system

Country Status (1)

Country Link
CN (1) CN111539427B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112218142A (en) * 2020-08-27 2021-01-12 厦门快商通科技股份有限公司 Method and device for separating voice from video with subtitles, storage medium and electronic equipment
CN113435438B (en) * 2021-06-28 2023-05-05 中国兵器装备集团自动化研究所有限公司 Image and subtitle fused video screen plate extraction and video segmentation method
CN113343986B (en) * 2021-06-29 2023-08-25 北京奇艺世纪科技有限公司 Subtitle time interval determining method and device, electronic equipment and readable storage medium
CN116886996B (en) * 2023-09-06 2023-12-01 浙江富控创联技术有限公司 Digital village multimedia display screen broadcasting system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332096A (en) * 2011-10-17 2012-01-25 中国科学院自动化研究所 Video caption text extraction and identification method
CN109729420A (en) * 2017-10-27 2019-05-07 腾讯科技(深圳)有限公司 Image processing method and device, mobile terminal and computer readable storage medium
CN110210299A (en) * 2019-04-26 2019-09-06 平安科技(深圳)有限公司 Voice training data creation method, device, equipment and readable storage medium storing program for executing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8737810B2 (en) * 2002-11-15 2014-05-27 Thomson Licensing Method and apparatus for cropping of subtitle elements
CN106254933B (en) * 2016-08-08 2020-02-18 腾讯科技(深圳)有限公司 Subtitle extraction method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332096A (en) * 2011-10-17 2012-01-25 中国科学院自动化研究所 Video caption text extraction and identification method
CN109729420A (en) * 2017-10-27 2019-05-07 腾讯科技(深圳)有限公司 Image processing method and device, mobile terminal and computer readable storage medium
CN110210299A (en) * 2019-04-26 2019-09-06 平安科技(深圳)有限公司 Voice training data creation method, device, equipment and readable storage medium storing program for executing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Rainer Lienhart 等."Automatic Text Segmentation and Text Recognition for Video Indexing".《Multimedia Systems》.2000,第8卷(第1期),第11-20页. *
王智慧 等."两阶段的视频字幕检测和提取算法".《计算机科学》.2018,第45卷(第8期),第50-53,62页. *
赵义武."基于边缘特征的视频字幕定位及字幕追踪方法".《现代计算机》.2018,2018年(第35期),第45-48页. *

Also Published As

Publication number Publication date
CN111539427A (en) 2020-08-14

Similar Documents

Publication Publication Date Title
CN111539427B (en) Video subtitle extraction method and system
US6101274A (en) Method and apparatus for detecting and interpreting textual captions in digital video signals
EP3016025B1 (en) Image processing device, image processing method, poi information creation system, warning system, and guidance system
US8401303B2 (en) Method and apparatus for identifying character areas in a document image
CN110287949B (en) Video clip extraction method, device, equipment and storage medium
KR100523898B1 (en) Identification, separation and compression of multiple forms with mutants
EP3096264A1 (en) Object detection system, object detection method, poi information creation system, warning system, and guidance system
US20020136458A1 (en) Method and apparatus for character string search in image
US8629918B2 (en) Image processing apparatus, image processing method and program
KR20070050752A (en) Image code based on moving picture, apparatus for generating/decoding image code based on moving picture and method therefor
US6532302B2 (en) Multiple size reductions for image segmentation
JP2004364234A (en) Broadcast program content menu creation apparatus and method
CN105657514A (en) Method and apparatus for playing video key information on mobile device browser
EP1119202A3 (en) Logo insertion on an HDTV encoder
JP2002027377A (en) Device and method for outputting picture and computer readable storage medium
Ghorpade et al. Extracting text from video
JP3655110B2 (en) Video processing method and apparatus, and recording medium recording video processing procedure
Jain et al. A hybrid approach for detection and recognition of traffic text sign using MSER and OCR
CN112735476A (en) Audio data labeling method and device
JP3534592B2 (en) Representative image generation device
JP3435334B2 (en) Apparatus and method for extracting character area in video and recording medium
CN113095239A (en) Key frame extraction method, terminal and computer readable storage medium
US8542931B2 (en) Ruled line extraction technique based on comparision results and indentifying noise based on line thickness
JP4974367B2 (en) Region dividing method and apparatus, and program
CN114359780A (en) Video image extraction method, system, readable storage medium and audition device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230614

Address after: 518000, 1603, Zone A, Huayi Building, No. 9 Pingji Avenue, Xialilang Community, Nanwan Street, Longgang District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen Youyou Brand Communication Co.,Ltd.

Address before: 430000 2007, building B, Optics Valley New World t+ office building, No. 355, Guanshan Avenue, East Lake New Technology Development Zone, Wuhan, Hubei Province

Applicant before: Wuhan yimantianxia Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231115

Address after: 430000 office 7, 20 / F, building B, office building, block a, Optics Valley New World Center, Donghu New Technology Development Zone, Wuhan, Hubei Province

Patentee after: Wuhan yimantianxia Technology Co.,Ltd.

Address before: 518000, 1603, Zone A, Huayi Building, No. 9 Pingji Avenue, Xialilang Community, Nanwan Street, Longgang District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen Youyou Brand Communication Co.,Ltd.

TR01 Transfer of patent right