CN111539427B

CN111539427B - Video subtitle extraction method and system

Info

Publication number: CN111539427B
Application number: CN202010356689.7A
Authority: CN
Inventors: 李钦; 王正航
Original assignee: Shenzhen Youyou Brand Communication Co ltd
Current assignee: Wuhan Yimantianxia Technology Co ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2023-07-21
Anticipated expiration: 2040-04-29
Also published as: CN111539427A

Abstract

The invention discloses a method and a system for extracting video captions, which relate to the field of image processing, and the method comprises the steps of selecting a specific area in a video picture as a caption identification area and selecting caption colors in the video picture; based on the determined caption identification area, cutting each frame of the video, and identifying the caption identification area of each frame of the video based on an image identification algorithm to judge whether the caption identification area of each frame of the video contains captions or not and judge whether caption identification areas of two adjacent frames of the video are similar or not; based on the judging result, classifying the frames which contain the same caption and are adjacent to each other into a group, and recording the time stamp of the head and tail frames in each group; and (3) performing OCR on the caption identification area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first and the last frames of the current group is the start time stamp and the end time stamp of the caption obtained currently, and generating a caption file. The invention can effectively save the extraction time of the video captions.

Description

Video subtitle extraction method and system

Technical Field

The invention relates to the field of image processing, in particular to a method and a system for extracting video subtitles.

Background

The subtitle refers to non-image content such as dialogue in television, movie and stage works, and also refers to characters added in the later stage of video-type movie works. The explanation characters and various characters appearing under the display screen of film screen, television set, etc., such as film name, staff list, gramophone, dialogue, explanatory words, etc., are called captions.

In practical applications, it is necessary to extract subtitles from video for some kind of usage. However, in the process of extracting video subtitles, the existing subtitle extraction method has the following disadvantages: 1. the time consumption is slow, and the subtitle extraction time is 5-10 times of original video time; 2. the generated caption time axis can not enter and exit with the same frame as the original video caption; 3. more manual operations are required to process the repeated frame images.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for extracting video subtitles, which can effectively save the extraction time of the video subtitles.

In order to achieve the above object, the present invention provides a method for extracting video subtitles, comprising the following steps:

selecting a specific area in a video picture as a caption identification area, and selecting caption colors in the video picture;

based on the determined caption identification area, cutting each frame of the video, and identifying the caption identification area of each frame of the video based on an image identification algorithm to judge whether the caption identification area of each frame of the video contains captions or not and judge whether caption identification areas of two adjacent frames of the video are similar or not;

based on the judging result, classifying the frames which contain the same caption and are adjacent to each other into a group, and recording the time stamp of the head and tail frames in each group;

and (3) performing OCR on the caption identification area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first and the last frames of the current group is the start time stamp and the end time stamp of the caption obtained currently, and generating a caption file.

On the basis of the technical proposal, the method comprises the following steps,

judging whether the caption identification area of each frame picture contains captions, wherein the judging mode comprises a global judging mode and a local judging mode;

the global judgment mode comprises the following steps:

converting the caption identification area of the current frame picture into a gray image;

reading a gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image, wherein gray is a preset gray value, and the value range is 0-255;

based on the obtained number, if the obtained number is larger than 3*h, the caption identification area of the current frame picture contains captions, otherwise, the caption identification area of the current frame picture does not contain captions, wherein h is the height of the gray level image;

the local judging mode comprises the following steps:

cutting the subtitle identification area of the current frame picture by using a preset cutting area to obtain a cutting image;

converting the clipping image into a gray image, and then reading the gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image;

based on the obtained number, if the obtained number belongs to [ cw, cw×ch/2], it indicates that the caption identification area of the current frame picture contains captions, otherwise, it indicates that the caption identification area of the current frame picture does not contain captions, where cw represents the width of the clip image and ch represents the height of the clip image.

On the basis of the above technical solution, the step of determining the preset clipping region includes:

transversely segmenting a caption identification area of a first frame picture in each group to obtain a plurality of unit areas which are identical in shape and square, storing by using an array, and storing the number of effective pixel points in the unit area of the caption identification area of one frame picture in each array;

judging the number of effective pixels of each unit area in a single subtitle identification area, if the number of the effective pixels of the current unit area meets [ h1, h1 x h/2], adding 1 to the weight value of the current unit area compared with the weight value of the last unit area, and if the number of the effective pixels of the current unit area does not meet [ h1, h1 x h/2], keeping the weight value of the current unit area consistent with the weight value of the last unit area, wherein the effective pixels refer to pixel points with gray values belonging to [ gray-15, gray+15], and h1 is the side length of the unit area;

dividing all unit areas of a caption identification area of a current frame picture into a left part and a right part, calculating the sum of weights of the unit areas of each part, judging whether the absolute value-right absolute value/min { left, right } is larger than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a centered aligned caption, wherein left represents the sum of weight values of the unit areas of the left part, and right represents the sum of weight values of the unit areas of the right part;

for the frame picture of the left aligned caption, finding out the unit area with the maximum weight value in the single caption identification area and the next unit area adjacent to the unit area, and merging the two found unit areas to obtain an area which is a preset cutting area; and for the frame picture with centered and aligned subtitles, finding out a unit area with the maximum weight value in the single subtitle identification area and a front unit area and a rear unit area adjacent to the unit area, and merging the found three unit areas to obtain an area which is a preset cutting area.

Based on the above technical solution, the specific judging process includes:

the caption identification areas of two adjacent frames of pictures are converted into gray images, so that two gray images are obtained;

reading two gray images from pixel points to obtain the number of pixel points with gray values belonging to [ gray-15, gray+15] in the two gray images;

based on the number of the obtained products,

if the number of pixels with gray values belonging to [ gray-15, gray+15] in the two gray images is 0, the caption identification areas of the two adjacent frames are dissimilar;

if diff/(valid 1+valid 2) <0.3, the caption recognition areas of the two current adjacent frames are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in one gray image, and valid2 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in the other Zhang Huidu image, diff represents the number of times that the pixels at the same position in two gray images are not valid pixels or invalid pixels at the same time, and the valid pixels refer to pixels with gray values belonging to [ gray-15, gray+15], and the invalid pixels refer to pixels with gray values not belonging to [ gray-15, gray+15 ];

if diff/(valid 1+ valid 2) is not less than 0.3, the subtitle recognition areas of the current two adjacent frames are not similar.

Based on the technical scheme, the method comprises the specific steps of performing OCR on the caption identification area of the first frame picture in each group to obtain captions, wherein the specific steps comprise:

longitudinally splicing caption identification areas of the first frame picture in each group according to the time sequence to form spliced pictures, wherein in the spliced pictures, the time stamp of the first and the last frame of the group where the caption identification area is drawn above each caption identification area;

OCR is carried out on the spliced pictures, and the obtained text contents are combined according to the time sequence to form texts;

and analyzing the text to obtain all the subtitles, and a starting time stamp of each subtitle.

The invention provides a video subtitle extraction system, which comprises the following steps:

the selecting module is used for selecting a specific area in the video picture as a caption identification area and selecting caption colors in the video picture;

the judging module is used for cutting each frame of picture of the video based on the determined caption identification area, identifying the caption identification area of each frame of picture based on an image identification algorithm so as to judge whether the caption identification area of each frame of picture contains captions or not and judge whether caption identification areas of two adjacent frames of pictures are similar or not;

the classifying module is used for classifying frames which contain the same subtitle and are adjacent to each other into a group based on the judging result, and recording the time stamp of the head and tail frames in each group;

and the recognition module is used for performing OCR on the caption recognition area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first and the last frames of the current group is the start time stamp and the end time stamp of the current obtained caption, and a caption file is generated.

the global judgment mode comprises the following steps:

the local judging mode comprises the following steps:

On the basis of the above technical solution, the clipping method uses a preset clipping region to clip the subtitle identification region of the current frame picture, where the determining process of the preset clipping region includes:

Based on the above technical solution, the specific judging process includes:

based on the number of the obtained products,

Based on the technical scheme, the method comprises the steps of performing OCR on the caption identification area of the first frame picture in each group to obtain the caption, wherein the specific process comprises the following steps:

Compared with the prior art, the invention has the advantages that: the specific area in the video picture is selected as the caption identification area, and the identification area is reduced, so that the extraction time of the video caption is effectively saved, manual intervention is less, only the caption identification area and the caption color are required to be selected manually, and the time stamp of the picture where the caption appears is recorded in the caption extraction process, so that the generated caption time axis and the original video caption are ensured to enter and exit in the same frame.

Drawings

Fig. 1 is a flowchart of a method for extracting video subtitles according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a video subtitle extraction method, which reduces the picture identification range by selecting a specific area in a video picture as a subtitle identification area, thereby effectively improving the subtitle extraction speed of the video picture. The embodiment of the invention correspondingly provides a video subtitle extraction system. The present invention will be described in further detail with reference to the accompanying drawings and examples.

Referring to fig. 1, the method for extracting video subtitles provided by the embodiment of the invention includes the following steps:

s1: and selecting a specific area in the video picture as a caption identification area, and selecting caption colors in the video picture. In the embodiment of the invention, the selection of the specific area and the selection of the caption color can be manually performed, the area where the caption appears in the video picture is generally a fixed area, the caption always appears in the fixed area along with the progress of video playing, and the picture identification range can be effectively reduced by selecting the caption identification area.

S2: based on the determined caption identification area, each frame of the video is cut, and based on an image identification algorithm, the caption identification area of each frame of the video is identified, so as to judge whether the caption identification area of each frame of the video contains captions or not, and judge whether the caption identification areas of two adjacent frames of the video are similar or not.

In the embodiment of the invention, whether the caption identification area of each frame picture contains captions is judged, wherein the judging mode comprises a global judging mode and a local judging mode.

The global judgment method comprises the following steps:

s201: converting the caption identification area of the current frame picture into a gray image;

s202: reading a gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image, wherein gray is a preset gray value, and the value range is 0-255;

s203: based on the obtained number, if the obtained number is larger than 3*h, the caption identification area of the current frame picture contains captions, otherwise, the caption identification area of the current frame picture does not contain captions, wherein h is the high of the gray level image, and multiplication is represented.

The local judging mode comprises the following steps:

s211: cutting the subtitle identification area of the current frame picture by using a preset cutting area to obtain a cutting image;

s212: converting the clipping image into a gray image, and then reading the gray image pixel by pixel to obtain the number of pixels with gray values belonging to [ gray-15, gray+15] in the gray image;

s213: based on the obtained number, if the obtained number belongs to [ cw, cw×ch/2], it indicates that the caption identification area of the current frame picture contains captions, otherwise, it indicates that the caption identification area of the current frame picture does not contain captions, where cw represents the width of the clip image and ch represents the height of the clip image.

In the embodiment of the invention, a subtitle identification area of a current frame picture is cut by using a preset cutting area, wherein the step of determining the preset cutting area comprises the following steps:

a: transversely segmenting the caption identification area of the first frame picture in each group to obtain a plurality of unit areas which are identical in shape and square, storing by using an array, and storing the number of effective pixel points in the unit area of the caption identification area of one frame picture in each array. The side length of the clipping unit area is the same as the height of the subtitle recognition area.

B: judging the number of effective pixels of each unit area in a single subtitle identification area, if the number of the effective pixels of the current unit area meets [ h1, h1 x h/2], adding 1 to the weight value of the current unit area compared with the weight value of the last unit area, and if the number of the effective pixels of the current unit area does not meet [ h1, h1 x h/2], keeping the weight value of the current unit area consistent with the weight value of the last unit area, wherein the effective pixels refer to pixel points with gray values belonging to [ gray-15, gray+15], and h1 is the side length of the unit area.

For example, after a single subtitle recognition area is transversely segmented, 4 unit areas, namely a unit area a, a unit area b, a unit area c and a unit area d, are sequentially obtained, if the number of effective pixels in the unit area a satisfies [ h1, h1 x h/2], the weight value of the unit area a is 1, if the number of effective pixels in the unit area b satisfies [ h1, h1 x h/2], the weight value of the unit area b is 2, if the number of effective pixels in the unit area c does not satisfy [ h1, h1 x h/2], the weight value of the unit area c is 2, and if the number of effective pixels in the unit area d satisfies [ h1, h1 x h/2], the weight value of the unit area d is 3.

C: dividing all unit areas of the caption identification area of the current frame picture into a left part and a right part, calculating the sum of the weights of the unit areas of each part, judging whether the absolute value-right absolute/min { left, right } is larger than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a centered aligned caption, wherein left represents the sum of the weight values of the unit areas of the left part, and right represents the sum of the weight values of the unit areas of the right part.

For example, a caption identification area of a certain frame includes 4 unit areas, namely, a unit area a, a unit area b, a unit area c and a unit area d, and after the left and right portions are divided, the left portion includes the unit area a and the unit area b, the right portion includes the unit area c and the unit area d, left represents the sum of the weight value of the unit area a and the weight value of the unit area b, and right represents the sum of the weight value of the unit area c and the weight value of the unit area d.

D: for the frame picture of the left aligned caption, finding out the unit area with the maximum weight value in the single caption identification area and the next unit area adjacent to the unit area, and merging the two found unit areas to obtain an area which is a preset cutting area; and for the frame picture with centered and aligned subtitles, finding out a unit area with the maximum weight value in the single subtitle identification area and a front unit area and a rear unit area adjacent to the unit area, and merging the found three unit areas to obtain an area which is a preset cutting area.

For example, the caption identification area of a certain frame picture includes 4 unit areas, namely a unit area a, a unit area b, a unit area c and a unit area d in sequence, wherein the weight value of the unit area c is the largest, if the current frame picture is a left aligned caption, the preset clipping area is an area obtained by combining the unit area c and the unit area d, and if the current frame picture is a centered aligned caption, the preset clipping area is an area obtained by combining the unit area b, the unit area c and the unit area d.

The method is characterized in that whether the frame picture contains characters or not is determined by determining the preset cutting area, whether the caption characters only have one character or a plurality of characters, the characters fall in the preset cutting area, so that effective point sampling is only needed in the preset cutting area, sampling is not needed in the whole caption identification area, and the influence of background noise points on pixel sampling can be effectively reduced.

In the embodiment of the invention, whether the caption identification areas of the front and rear adjacent two frames of pictures are similar is judged, and the specific judging process comprises the following steps:

s231: the caption identification areas of two adjacent frames of pictures are converted into gray images, so that two gray images are obtained;

s232: reading two gray images from pixel points to obtain the number of pixel points with gray values belonging to [ gray-15, gray+15] in the two gray images;

s233: based on the number of the obtained products,

if diff/(valid 1+valid 2) <0.3, the caption recognition areas of the two current adjacent frames are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in one gray image, and valid2 represents the number of pixels with gray values belonging to [ gray-15, gray+15] in the other Zhang Huidu image, diff represents the number of times that the pixels at the same position in two gray images are not valid pixels or invalid pixels at the same time, the valid pixels refer to the pixels with gray values belonging to [ gray-15, gray+15], and the invalid pixels refer to the pixels with gray values not belonging to [ gray-15, gray+15 ];

S3: based on the judging result, classifying the frames which contain the same caption and are adjacent to each other into a group, and recording the time stamp of the head and tail frames in each group.

In the embodiment of the invention, the frames which contain the same caption and are adjacent to each other in the video are grouped into one group, and the time stamp of the head and tail frames in each group is recorded, and the specific steps comprise: sequentially judging the caption identification area of each frame of picture, if the current caption identification area contains the caption, recording the caption identification area of the current frame of picture, recording the time stamp of the current frame of picture, then judging the caption identification area of the next frame of picture, judging whether the caption identification area contains characters or not, and judging whether the caption identification area is similar to the caption identification area of the previous frame of picture or not:

if the characters are contained and similar, continuing to judge the caption identification area of the next frame of picture; if the characters are contained and are dissimilar, recording a caption identification area of the current frame picture, and recording a time stamp of the current frame picture; if the text is not contained, recording the time stamp of the current frame picture; and so on, the video contains the same subtitle and adjacent frames are grouped together. The caption identification area containing the same caption refers to the frame picture containing the characters and the characters are the same.

S4: and performing OCR (Optical Character Recognition ) on the caption recognition area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first and last frames of the current group is the start time stamp and the end time stamp of the current obtained caption, and generating a caption file.

OCR is carried out on the caption identification area of the first frame picture in each group to obtain captions, and the specific steps comprise:

s401: longitudinally splicing caption identification areas of the first frame picture in each group according to the time sequence to form spliced pictures, wherein in the spliced pictures, the time stamp of the first and the last frame of the group where the caption identification area is drawn above each caption identification area;

s402: OCR is carried out on the spliced pictures, and the obtained text contents are combined according to the time sequence to form texts;

s403: and analyzing the text to obtain all the subtitles, and a start time stamp of each subtitle, and outputting the subtitle file according to a srt format.

According to the video subtitle extraction method, the specific area in the video picture is selected as the subtitle identification area, and the identification area is reduced, so that the video subtitle extraction time is effectively saved, manual intervention is less, only the subtitle identification area and the subtitle color are needed to be selected manually, the time stamp of the picture with the subtitle is recorded in the subtitle extraction process, and the generated subtitle time axis and the original video subtitle are ensured to enter and exit in the same frame.

The embodiment of the invention provides a video subtitle extraction system which comprises a selection module, a judgment module, a classification module and an identification module.

The selecting module is used for selecting a specific area in the video picture as a caption identification area and selecting caption colors in the video picture; the judging module is used for cutting each frame of picture of the video based on the determined caption identification area, identifying the caption identification area of each frame of picture based on an image identification algorithm so as to judge whether the caption identification area of each frame of picture contains captions or not and judge whether caption identification areas of two adjacent frames of pictures are similar or not; the classifying module is used for classifying the frames which contain the same caption and are adjacent to each other into a group based on the judging result, and recording the time stamp of the head and tail frames in each group; the recognition module is used for performing OCR on the caption recognition area of the first frame picture in each group to obtain the caption, and the time stamp of the first and the last frame of the current group is the starting time stamp and the ending time stamp of the caption obtained currently, and a caption file is generated.

In the embodiment of the invention, whether the caption identification area of each frame of picture contains captions is judged, wherein the judging mode comprises a global judging mode and a local judging mode;

the global judgment mode comprises the following steps:

the local judgment mode comprises the following steps:

In the embodiment of the invention, a subtitle identification area of a current frame picture is cut by using a preset cutting area, wherein the determining process of the preset cutting area comprises the following steps:

based on the number of the obtained products,

In the embodiment of the invention, OCR is performed on the caption identification area of the first frame picture in each group to obtain the caption, and the specific process comprises the following steps:

The invention is not limited to the embodiments described above, but a number of modifications and adaptations can be made by a person skilled in the art without departing from the principle of the invention, which modifications and adaptations are also considered to be within the scope of the invention. What is not described in detail in this specification is prior art known to those skilled in the art.

Claims

1. The method for extracting the video subtitle is characterized by comprising the following steps of:

performing OCR on a caption identification area of a first frame picture in each group to obtain captions, wherein the time stamp of the first and the last frames of the current group is the start time stamp and the end time stamp of the current obtained captions, and generating caption files;

wherein, the liquid crystal display device comprises a liquid crystal display device,

the global judgment mode comprises the following steps:

the local judging mode comprises the following steps:

2. The method for extracting video subtitles of claim 1, wherein the subtitle identification area of the current frame picture is cropped using a preset cropping area, and wherein the determining of the preset cropping area comprises:

3. The method for extracting video subtitles of claim 1, wherein: the specific judging process includes the following steps:

based on the number of the obtained products,

4. The method for extracting video subtitles of claim 1, wherein said OCR is performed on a subtitle recognition area of a first frame picture in each group to obtain subtitles, and the specific steps include:

5. A video subtitle extraction system, comprising the steps of:

the recognition module is used for performing OCR on the caption recognition area of the first frame picture in each group to obtain captions, and the time stamp of the first and the last frame of the current group is the start time stamp and the end time stamp of the caption currently obtained, and a caption file is generated;

the global judgment mode comprises the following steps:

the local judging mode comprises the following steps:

6. The video subtitle extraction system of claim 5, wherein the cropping the subtitle identification area of the current frame picture using a preset cropping area, wherein the determining of the preset cropping area comprises:

7. The video subtitle extraction system of claim 5, wherein: the specific judging process includes the following steps:

based on the number of the obtained products,

8. The system for extracting video subtitles of claim 5 wherein said OCR is performed on a subtitle recognition area of a first frame of each group to obtain subtitles, the specific process comprising: