CN111539427A

CN111539427A - Method and system for extracting video subtitles

Info

Publication number: CN111539427A
Application number: CN202010356689.7A
Authority: CN
Inventors: 李钦; 王正航
Original assignee: Wuhan Yimantianxia Technology Co ltd
Current assignee: Wuhan Yimantianxia Technology Co ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-14
Anticipated expiration: 2040-04-29
Also published as: CN111539427B

Abstract

The invention discloses a method and a system for extracting video subtitles, which relate to the field of image processing and comprise the steps of selecting a specific area in a video picture as a subtitle identification area and selecting subtitle colors in the video picture; based on the determined caption identification area, cutting each frame of picture of the video, and based on an image identification algorithm, identifying the caption identification area of each frame of picture to judge whether the caption identification area of each frame of picture contains a caption or not and judge whether the caption identification areas of two adjacent frames of pictures are similar or not; based on the judgment result, grouping adjacent frames containing the same caption into a group, and recording the time stamps of the head and tail frames in each group; and performing OCR on the subtitle recognition area of the first frame picture in each group to obtain the subtitles, wherein the time stamps of the first frame and the last frame of the current group are the start time stamp and the end time stamp of the currently obtained subtitles, and generating a subtitle file. The invention can effectively save the extraction time of the video subtitles.

Description

Method and system for extracting video subtitles

Technical Field

The invention relates to the field of image processing, in particular to a method and a system for extracting video subtitles.

Background

The caption is a character showing non-image contents such as dialogue in television, movie and stage works, and is also a character added at the later stage of video and movie works. The commentary and various characters appearing below the display screens of movie screens, televisions and the like, such as the titles, the credits, the librets, the dialogues and the explanatory words of the movies are called subtitles according to the introduction of people, place names, years and the like.

In practical applications, the subtitles in the video need to be extracted for certain use requirements. However, in the process of extracting the video subtitles, the existing subtitle extraction method has the following disadvantages: 1. the time consumption is slow, and the subtitle extraction time is usually 5-10 times of the original video time; 2. the generated caption time axis can not enter and exit with the original video caption in the same frame; 3. more manual operations are required to process the repeated frame images.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for extracting video subtitles, which can effectively save the extraction time of the video subtitles.

In order to achieve the above object, the present invention provides a method for extracting video subtitles, comprising the following steps:

selecting a specific area in a video picture as a subtitle identification area, and selecting the color of a subtitle in the video picture;

based on the determined caption identification area, cutting each frame of picture of the video, and based on an image identification algorithm, identifying the caption identification area of each frame of picture to judge whether the caption identification area of each frame of picture contains a caption or not and judge whether the caption identification areas of two adjacent frames of pictures are similar or not;

based on the judgment result, grouping adjacent frames containing the same caption into a group, and recording the time stamps of the head and tail frames in each group;

and performing OCR on the subtitle recognition area of the first frame picture in each group to obtain the subtitles, wherein the time stamps of the first frame and the last frame of the current group are the start time stamp and the end time stamp of the currently obtained subtitles, and generating a subtitle file.

On the basis of the technical proposal, the device comprises a shell,

judging whether the caption identification area of each frame of picture contains the caption or not, wherein the judging mode comprises a global judging mode and a local judging mode;

the global judgment mode comprises the following steps:

converting the caption identification area of the current frame picture into a gray image;

reading the gray level image pixel by pixel to obtain the number of pixels of which the gray level values belong to [ gray-15, gray +15] in the gray level image, wherein gray is a preset gray level value and the value range is 0-255;

based on the obtained number, if the obtained number is more than 3 x h, the caption identification area of the current frame picture contains the caption, otherwise, the caption identification area of the current frame picture does not contain the caption, wherein h is the height of the gray level image;

the local judgment mode comprises the following steps:

cutting the subtitle recognition area of the current frame picture by using a preset cutting area to obtain a cut image;

converting the cut image into a gray image, and reading the gray image pixel by pixel to obtain the number of pixels of which the gray values belong to [ gray-15, gray +15] in the gray image;

and based on the obtained number, if the obtained number belongs to [ cw, cw × ch/2], indicating that the subtitle identification region of the current frame picture contains the subtitle, and otherwise, indicating that the subtitle identification region of the current frame picture does not contain the subtitle, wherein cw represents the width of the cropping image, and ch represents the height of the cropping image.

On the basis of the technical scheme, the subtitle recognition area of the current frame picture is cut by using the preset cutting area, wherein the determining step of the preset cutting area comprises the following steps:

transversely segmenting the caption identification area of the first frame of picture in each group to obtain a plurality of unit areas which are identical in shape and are square, storing the unit areas by using arrays, and storing the number of effective pixel points in the unit area of the caption identification area of one frame of picture by each array;

judging the number of effective pixels of each unit region in a single subtitle identification region, if the number of effective pixels of the current unit region meets [ h1, h1 h/2], adding 1 to the weight value of the current unit region compared with the weight value of the previous unit region, if the number of effective pixels of the current unit region does not meet [ h1, h1 h/2], keeping the weight value of the current unit region consistent with the weight value of the previous unit region, wherein the effective pixels refer to pixels of which the gray values belong to [ gray-15, gray +15], and h1 is the side length of the unit region;

dividing all unit areas of a caption identification area of a current frame picture into a left part and a right part, calculating the sum of weights of each part of unit areas, and then judging whether | left-right |/min { left, right } is greater than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a middle aligned caption, wherein left represents the sum of weight values of the left part of unit areas, and right represents the sum of weight values of the right part of unit areas;

for the frame picture of the left-aligned caption, finding out a unit area with the maximum weight value in the single caption identification area and a next unit area adjacent to the unit area, and combining the two found unit areas to obtain an area which is a preset clipping area; and for the frame picture of the centered aligned caption, finding out a unit area with the maximum weight value in the single caption identification area and a front unit area and a rear unit area which are adjacent to the unit area, and combining the three found unit areas to obtain an area which is a preset clipping area.

On the basis of the technical scheme, the judging whether the caption identification areas of the two adjacent frames of pictures are similar or not comprises the following specific judging process:

converting the caption identification areas of two adjacent frames of pictures into gray level images to obtain two gray level images;

reading two gray level images pixel by pixel to obtain the number of pixels with gray levels of [ gray-15, gray +15] in the two gray level images;

based on the number of the obtained numbers,

if the number of pixels with gray values belonging to [ gray-15, gray +15] in the two gray images is 0, the caption identification areas of two adjacent frames of pictures are not similar;

if diff/(valid1+ valid2) <0.3, the subtitle recognition regions of two adjacent frames of pictures are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray +15] in one gray image, valid2 represents the number of pixels with gray values belonging to [ gray-15, gray +15] in the other gray image, diff represents the number of times that the pixels at the same position in the two gray images are not simultaneously valid pixels or invalid pixels, the valid pixels refer to the pixels with gray values belonging to [ gray-15, gray +15], and the invalid pixels refer to the pixels with gray values not belonging to [ gray-15, gray +15 ];

if diff/(valid1+ valid2) ≧ 0.3, the subtitle recognition area of the current two adjacent frames is dissimilar.

On the basis of the technical scheme, the OCR is performed on the subtitle recognition area of the first frame picture in each group to obtain the subtitle, and the specific steps include:

longitudinally splicing the subtitle identification areas of the first frame of picture in each group according to the time sequence to form a spliced picture, and drawing a timestamp of the first frame and the last frame of the group where the subtitle identification areas are located above each subtitle identification area in the spliced picture;

OCR is carried out on the spliced pictures, and the obtained character contents are combined according to the time sequence to form a text;

and analyzing the text to obtain all the subtitles and the start time stamp of each subtitle.

The invention provides a video subtitle extraction system, which comprises the following steps:

the selection module is used for selecting a specific area in a video picture as a subtitle identification area and selecting the color of a subtitle in the video picture;

the judging module is used for cutting each frame of picture of the video based on the determined caption identification area, identifying the caption identification area of each frame of picture based on an image identification algorithm so as to judge whether the caption identification area of each frame of picture contains a caption or not and judge whether the caption identification areas of two adjacent frames of pictures are similar or not;

the classification module is used for classifying adjacent frames containing the same subtitles in the video into a group based on the judgment result and recording timestamps of head and tail frames in each group;

and the recognition module is used for performing OCR on the subtitle recognition area of the first frame picture in each group to obtain the subtitles, and the time stamps of the first frame and the last frame of the current group are the starting time stamp and the ending time stamp of the currently obtained subtitles to generate the subtitle file.

On the basis of the technical proposal, the device comprises a shell,

the global judgment mode comprises the following processes:

the local judgment mode comprises the following processes:

On the basis of the technical scheme, the subtitle recognition area of the current frame picture is cut by using the preset cutting area, wherein the determining process of the preset cutting area comprises the following steps:

based on the number of the obtained numbers,

On the basis of the technical scheme, the OCR is performed on the subtitle recognition area of the first frame picture in each group to obtain the subtitle, and the specific process comprises the following steps:

Compared with the prior art, the invention has the advantages that: the specific area in the video picture is selected as the caption identification area, the identification area is reduced, so that the extraction time of the video caption is effectively saved, manual intervention is less, only the caption identification area and the caption color need to be selected manually, the timestamp of the picture with the caption can be recorded in the caption extraction process, and the generated caption time axis and the original video caption can be ensured to enter and exit in the same frame.

Drawings

Fig. 1 is a flowchart of a method for extracting video subtitles according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method for extracting video subtitles, which is used for reducing the picture identification range by selecting a specific area in a video picture as a subtitle identification area, thereby effectively improving the subtitle extraction speed of the video picture. The embodiment of the invention correspondingly provides a video subtitle extraction system. The present invention will be described in further detail with reference to the accompanying drawings and examples.

Referring to fig. 1, a method for extracting a video subtitle according to an embodiment of the present invention includes the following steps:

s1: and selecting a specific area in the video picture as a subtitle identification area, and selecting the color of the subtitle in the video picture. In the embodiment of the invention, the selection of the specific area and the selection of the caption color can be manually selected in a manual mode, the area where the caption appears in the video picture is generally a fixed area, the caption always appears in the fixed area along with the progress of video playing, and the picture identification range can be effectively reduced by selecting the caption identification area.

S2: and based on the determined caption identification area, cutting each frame of picture of the video, and based on an image identification algorithm, identifying the caption identification area of each frame of picture to judge whether the caption identification area of each frame of picture contains a caption or not and judge whether the caption identification areas of two adjacent frames of pictures are similar or not.

In the embodiment of the invention, whether the caption identification area of each frame of picture contains the caption or not is judged, wherein the judging mode comprises a global judging mode and a local judging mode.

The global judgment mode comprises the following steps:

s201: converting the caption identification area of the current frame picture into a gray image;

s202: reading the gray level image pixel by pixel to obtain the number of pixels of which the gray level values belong to [ gray-15, gray +15] in the gray level image, wherein gray is a preset gray level value and the value range is 0-255;

s203: based on the obtained number, if the obtained number is larger than 3 x h, the caption identification area of the current frame picture contains the caption, otherwise, the caption identification area of the current frame picture does not contain the caption, wherein h is the height of the gray level image, and x represents multiplication.

The local judgment method comprises the following steps:

s211: cutting the subtitle recognition area of the current frame picture by using a preset cutting area to obtain a cut image;

s212: converting the cut image into a gray image, and reading the gray image pixel by pixel to obtain the number of pixels of which the gray values belong to [ gray-15, gray +15] in the gray image;

s213: and based on the obtained number, if the obtained number belongs to [ cw, cw × ch/2], indicating that the subtitle identification region of the current frame picture contains the subtitle, and otherwise, indicating that the subtitle identification region of the current frame picture does not contain the subtitle, wherein cw represents the width of the cropping image, and ch represents the height of the cropping image.

In the embodiment of the invention, a preset cutting area is used for cutting a caption identification area of a current frame picture, wherein the step of determining the preset cutting area comprises the following steps:

a: and transversely segmenting the caption identification area of the first frame of picture in each group to obtain a plurality of unit areas which are identical in shape and are square, storing by using arrays, and storing the number of effective pixel points in the unit area of the caption identification area of one frame of picture by each array. And the side length of the unit area obtained by cutting is the same as the height of the subtitle identification area.

B: judging the number of effective pixels of each unit region in a single subtitle identification region, if the number of effective pixels of the current unit region meets [ h1, h1 h/2], adding 1 to the weight value of the current unit region compared with the weight value of the previous unit region, if the number of effective pixels of the current unit region does not meet [ h1, h1 h/2], keeping the weight value of the current unit region consistent with the weight value of the previous unit region, wherein the effective pixels refer to pixels of which the gray values belong to [ gray-15, gray +15], and h1 is the side length of the unit region.

For example, after a certain single subtitle recognition region is transversely divided, 4 unit regions are sequentially obtained, namely a unit region a, a unit region b, a unit region c and a unit region d, wherein if the number of effective pixels in the unit region a satisfies [ h1, h1 × h/2], the weight value of the unit region a is 1, if the number of effective pixels in the unit region b satisfies [ h1, h1 × h/2], the weight value of the unit region b is 2, if the number of effective pixels in the unit region c does not satisfy [ h1, h1 × h/2], the weight value of the unit region c is 2, and if the number of effective pixels in the unit region d satisfies [ h1, h1 × h/2], the weight value of the unit region d is 3.

C: dividing all unit areas of a caption identification area of a current frame picture into a left part and a right part, calculating the weight sum of each part of unit area, and then judging whether | left-right |/min { left, right } is greater than 0.1, if so, the current frame picture is a left aligned caption, otherwise, the current frame picture is a middle aligned caption, wherein left represents the sum of weight values of the left part of unit area, and right represents the sum of weight values of the right part of unit area.

For example, a subtitle recognition area of a frame includes 4 unit areas, which are, in turn, a unit area a, a unit area b, a unit area c, and a unit area d, and after left and right division, the left part includes the unit area a and the unit area b, the right part includes the unit area c and the unit area d, left indicates a sum of a weight value of the unit area a and a weight value of the unit area b, and right indicates a sum of a weight value of the unit area c and a weight value of the unit area d.

D: for the frame picture of the left-aligned caption, finding out a unit area with the maximum weight value in the single caption identification area and a next unit area adjacent to the unit area, and combining the two found unit areas to obtain an area which is a preset clipping area; and for the frame picture of the centered aligned caption, finding out a unit area with the maximum weight value in the single caption identification area and a front unit area and a rear unit area which are adjacent to the unit area, and combining the three found unit areas to obtain an area which is a preset clipping area.

For example, a subtitle recognition area of a frame includes 4 unit areas, which are, in turn, a unit area a, a unit area b, a unit area c, and a unit area d, where a weight value of the unit area c is the largest, if a current frame is a left-aligned subtitle, a preset clipping area is an area obtained by merging the unit area c and the unit area d, and if the current frame is a center-aligned subtitle, the preset clipping area is an area obtained by merging the unit area b, the unit area c, and the unit area d.

The preset cutting area is determined to judge whether the frame picture contains characters or not more accurately, no matter whether the caption characters only have one character or a plurality of characters, the characters fall in the preset cutting area, so that effective point sampling is only needed in the preset cutting area, sampling in the whole caption identification area is not needed, and the influence of background noise on pixel sampling can be effectively reduced.

In the embodiment of the invention, whether the caption identification areas of two adjacent frames of pictures are similar or not is judged, and the specific judgment process comprises the following steps:

s231: converting the caption identification areas of two adjacent frames of pictures into gray level images to obtain two gray level images;

s232: reading two gray level images pixel by pixel to obtain the number of pixels with gray levels of [ gray-15, gray +15] in the two gray level images;

s233: based on the number of the obtained numbers,

if diff/(valid1+ valid2) <0.3, the subtitle recognition regions of two adjacent frames of pictures are similar, wherein valid1 represents the number of pixels with gray values belonging to [ gray-15, gray +15] in one gray image, valid2 represents the number of pixels with gray values belonging to [ gray-15, gray +15] in the other gray image, diff represents the number of times that the pixels at the same position in the two gray images are not simultaneously valid pixels or invalid pixels, valid pixels refer to pixels with gray values belonging to [ gray-15, gray +15], and invalid pixels refer to pixels with gray values not belonging to [ gray-15, gray +15 ];

S3: and based on the judgment result, grouping adjacent frames containing the same caption in the video into a group, and recording the time stamp of the head frame and the tail frame in each group.

In the embodiment of the invention, adjacent frames containing the same subtitles in a video are grouped into one group, and the time stamps of the head frame and the tail frame in each group are recorded, and the specific steps comprise: sequentially judging the caption identification area of each frame of picture, if the current caption identification area contains a caption, recording the caption identification area of the current frame of picture, recording the timestamp of the current frame of picture, judging the caption identification area of the next frame of picture, judging whether the caption identification area contains characters or not, and whether the caption identification area is similar to the caption identification area of the previous frame of picture or not:

if the characters are contained and similar, continuously judging the subtitle identification area of the next frame of picture; if the current frame contains characters and is not similar to the characters, recording a subtitle identification area of the current frame and recording a time stamp of the current frame; if the current frame does not contain characters, recording a time stamp of the current frame picture; and by analogy, grouping adjacent frames containing the same caption in the video. The caption identification area containing the same caption frame picture contains the same character.

S4: and performing OCR (Optical character recognition) on the caption recognition area of the first frame picture in each group to obtain the caption, wherein the time stamp of the first frame and the time stamp of the last frame of the current group are the start time stamp and the end time stamp of the currently obtained caption, and generating a caption file.

Performing OCR on a subtitle recognition area of a first frame of picture in each group to obtain subtitles, and specifically comprising the following steps of:

s401: longitudinally splicing the subtitle identification areas of the first frame of picture in each group according to the time sequence to form a spliced picture, and drawing a timestamp of the first frame and the last frame of the group where the subtitle identification areas are located above each subtitle identification area in the spliced picture;

s402: OCR is carried out on the spliced pictures, and the obtained character contents are combined according to the time sequence to form a text;

s403: and analyzing the text to obtain all subtitles and the start time stamp of each subtitle, and outputting a subtitle file according to srt format.

According to the method for extracting the video subtitles, the specific area in the video picture is selected as the subtitle identification area, and the identification area is reduced, so that the extraction time of the video subtitles is effectively saved, manual intervention is less, the subtitle identification area and the subtitle color only need to be selected manually, the timestamp of the picture with the subtitles is recorded in the subtitle extraction process, and the generated subtitle time axis and the original video subtitles can be ensured to enter and exit in the same frame.

The video subtitle extraction system provided by the embodiment of the invention comprises a selection module, a judgment module, a classification module and an identification module.

The selection module is used for selecting a specific area in a video picture as a subtitle identification area and selecting the color of a subtitle in the video picture; the judging module is used for cutting each frame of picture of the video based on the determined caption identification area, identifying the caption identification area of each frame of picture based on an image identification algorithm so as to judge whether the caption identification area of each frame of picture contains a caption or not and judge whether the caption identification areas of two adjacent frames of pictures are similar or not; the classification module is used for classifying adjacent frames containing the same subtitles in the video into a group based on the judgment result and recording timestamps of head and tail frames in each group; and the recognition module is used for performing OCR on the subtitle recognition area of the first frame picture in each group to obtain the subtitle, and the time stamp of the first frame and the time stamp of the last frame of the current group are the starting time stamp and the ending time stamp of the currently obtained subtitle and generate the subtitle file.

In the embodiment of the invention, whether a caption identification area of each frame of picture contains a caption or not is judged, wherein the judging mode comprises a global judging mode and a local judging mode;

the global judgment mode comprises the following processes:

the local judgment mode comprises the following processes:

In the embodiment of the invention, a preset cutting area is used for cutting a caption identification area of a current frame picture, wherein the determining process of the preset cutting area comprises the following steps:

based on the number of the obtained numbers,

In the embodiment of the invention, OCR is carried out on the subtitle recognition area of the first frame of picture in each group to obtain the subtitle, and the specific process comprises the following steps:

The present invention is not limited to the above-described embodiments, and it will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements are also considered to be within the scope of the present invention. Those not described in detail in this specification are within the skill of the art.

Claims

1. A method for extracting video subtitles is characterized by comprising the following steps:

2. The method for extracting a video subtitle as claimed in claim 1, wherein:

the global judgment mode comprises the following steps:

the local judgment mode comprises the following steps:

3. The method for extracting video subtitles of claim 2, wherein the step of cropping the subtitle recognition area of the current frame picture by using a preset cropping area comprises:

4. The method for extracting a video subtitle as claimed in claim 1, wherein: the method for judging whether the caption identification areas of the two adjacent frames of pictures are similar comprises the following specific judgment process:

based on the number of the obtained numbers,

5. The method for extracting video subtitles according to claim 1, wherein the OCR is performed on the subtitle recognition area of the first frame of picture in each group to obtain subtitles, and the specific steps include:

6. A video subtitle extraction system, comprising:

7. The system for extracting a video subtitle of claim 6, wherein:

the global judgment mode comprises the following processes:

the local judgment mode comprises the following processes:

8. The system for extracting video subtitles of claim 7, wherein the clipping is performed on the subtitle recognition area of the current frame picture using a preset clipping area, wherein the determining of the preset clipping area comprises:

9. The system for extracting a video subtitle of claim 6, wherein: the method for judging whether the caption identification areas of the two adjacent frames of pictures are similar comprises the following specific judgment process:

based on the number of the obtained numbers,

10. The system for extracting video subtitles of claim 6, wherein the OCR is performed on the subtitle recognition area of the first frame of picture in each group to obtain subtitles, and the specific process includes: