CN107145888A

CN107145888A - Video caption real time translating method

Info

Publication number: CN107145888A
Application number: CN201710345936.1A
Authority: CN
Inventors: 代劲; 王族; 宋娟; 张鹏
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-05-17
Filing date: 2017-05-17
Publication date: 2017-09-08

Abstract

The present invention provides a kind of video caption real time translating method, including：Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained；Based on MSER algorithms, the MSER regions of original image and multiple single channel images are extracted respectively；The local contrast between each MSER region and its background area is calculated, and according to each local contrast, it is determined whether corresponding MSER regions are filtered out；Determine the border key point in each MSER region；Using border key point as category filter feature, to filtering out each rear remaining MSER region by the SVM progress category filters trained, obtain text filed；According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text differentiation, according to per two neighboring the distance between text filed, text filed classifying to one text row each on one text row；Based on each text filed progress video caption real time translation after classification.

Description

Real-time translation method for video captions

Technical Field

The invention belongs to the field of image processing, and particularly relates to a real-time translation method for video subtitles.

Background

In recent years, text detection and recognition in images of natural scenes has been a subject of intense research in the fields of computer vision, pattern recognition and even document analysis. Researchers have proposed a number of new ideas and methods for extracting textual information from images of natural scenes. However, when translating video subtitles at present, because the time complexity of extracting text information from an image is high, real-time translation of the video subtitles cannot be realized.

Disclosure of Invention

The invention provides a real-time translation method for video subtitles, which aims to solve the problem that the real-time translation of the video subtitles cannot be realized due to higher time complexity of extracting text information from an image when the video subtitles are translated at present.

According to a first aspect of the embodiments of the present invention, a method for real-time translating video subtitles is provided, which includes:

performing multi-channel extraction on an original image intercepted from a video to obtain a plurality of single-channel images;

respectively extracting MSER areas of the original image and the single-channel images based on a MSER algorithm of a maximum stable extremum area;

introducing local contrast text features, calculating the local contrast between each MSER region and the background region thereof, and determining whether to filter the corresponding MSER region according to each local contrast;

introducing text features of the boundary key points, and determining the boundary key points of each MSER region;

classifying and screening all MSER regions left after filtering by using the boundary key points as classification and screening characteristics through a trained Support Vector Machine (SVM) to obtain text regions;

according to the distance between every two adjacent text regions in the vertical direction, distinguishing each text region, and according to the distance between every two adjacent text regions in the same text line, classifying each text region in the same text line;

and performing real-time translation on the video subtitles based on the classified text regions.

In an optional implementation manner, before performing multi-channel extraction on an original image intercepted from a video to obtain a plurality of single-channel images, the method further includes: preprocessing the original image including sharpening and blurring.

In another optional implementation manner, the performing multi-channel extraction on the original image captured from the video to obtain a plurality of single-channel images includes: and respectively carrying out R, G, B, H, S, V six-channel image extraction on the original image and the preprocessed original image so as to obtain a plurality of single-channel images.

In another optional implementation manner, the calculating a local contrast between each MSER region and its background region, and determining whether to filter out the corresponding MSER region according to each local contrast includes:

the local contrast lc between each MSER region and its background is calculated according to the following formula:

wherein n represents the number of pixels corresponding to the MSER region, k represents the number of pixels corresponding to the background region, and R_i、G_i、B_iRespectively representing the values of red, green and blue of three channels of the image corresponding to the MSER area, wherein i represents the ith pixel point corresponding to the MSER area, and j represents the jth pixel point corresponding to the background area;

and for each MSER area, if the local contrast of the MSER area is less than a first preset threshold, filtering the MSER area.

In another optional implementation manner, the determining the boundary key point of each MSER region includes:

aiming at each MSER region, setting the gray value of the pixel point of the MSER detected in the MSER region as 255 and setting the gray values of other pixel points as 0;

successively traversing each pixel point in the MSER area, and if the gray value of the pixel point is 255 and the gray value of at least one of the adjacent pixel points is 0, determining the pixel point as a contour point;

after all contour points of at least one MSER area are obtained, each contour point is compressed by adopting a Douglas-Pock algorithm, redundant points are removed, and boundary key points corresponding to the MSER area are obtained.

In another optional implementation manner, the aspect ratio, the area perimeter ratio, the convex hull area ratio and the stroke width area ratio of each remaining MSER region after filtering are used as classification screening features, and classification screening is performed on each remaining MSER region after filtering through a trained SVM.

In another alternative implementation manner, in the process of training the SVM, the number ratio of the positive samples to the negative samples is controlled to be 1: and 3, the positive sample is a letter and an Arabic number corresponding to the translation target language, and the negative sample is a non-text region which is obtained by manually identifying and marking the extracted MSER region after the MSER regions of the original image and the multiple single-channel images are respectively extracted.

In another optional implementation manner, the performing text line distinction on each text region according to a distance between every two adjacent text regions in the vertical direction includes:

the distance d between every two adjacent text areas in the vertical direction is calculated according to the following formula_v：

Wherein, b₁Y-axis coordinate, t, representing the bottom of a vertically adjacent upper text region₂Representing top Y-axis coordinates, h, of vertically adjacent lower text regions₂Representing a height of a lower text region adjacent in a vertical direction;

for every two adjacent text regions in the vertical direction, if the distance d between the two adjacent text regions_vIf the number of the adjacent text areas is larger than the second preset threshold value, the two adjacent text areas are classified into the same text line, otherwise, the two adjacent text areas are classified into different text lines.

In another optional implementation manner, the classifying the text regions of the same text line according to the distance between every two adjacent text regions on the same text line includes:

the distance d between every two adjacent text regions of the same text line is calculated according to the following formula_h：

Wherein,represents the average value of all letter widths of the text line, and deltad represents the X-axis direction of adjacent letters in two adjacent text areasDifference in upward distance;

for every two adjacent text regions of the same text line, if the distance d between the two adjacent text regions_hIf the number of the adjacent text areas is larger than the third preset threshold, the two adjacent text areas are classified into one type, otherwise, the two adjacent text areas are classified into a different type.

In another alternative implementation, when an original image is cut out from a video, a video picture is cut out according to frames, and two thirds of the area below the cut-out video image is used as the original image.

The invention has the beneficial effects that:

1. before the text is identified, a plurality of single-channel images are introduced, color information of an original image is effectively utilized, richer basic data are provided for text region extraction, then local contrast is introduced, threshold filtering is carried out on MSER regions extracted from the original image and the single-channel images, the accuracy of text region extraction can be improved, the time complexity of local contrast filtering is linear time, the filtering time is short, a foundation can be provided for real-time translation of video subtitles, boundary key points are introduced to serve as SVM classification screening features, non-text region interference in the MSER regions can be eliminated even when the images are rotated and scaled, the sensitivity of image rotation and scaling of text region extraction can be improved, after the MSER regions are subjected to threshold filtering based on the local contrast, the remaining MSER regions after filtering are screened by an SVM classifier, the accuracy of text region extraction can be improved, and aiming at a text region obtained after training and screening are finished, the invention adopts two layers of text classification algorithms of text line classification in the vertical direction and text region classification in the same text line in the horizontal direction, thereby greatly reducing time complexity, improving word recognition rate and providing a basis for realizing real-time translation of video captions, and therefore, the invention can realize real-time and accurate translation of the video captions;

2. according to the method, the original image is subjected to sharpening pretreatment, so that the sharpened original image can enhance the contrast between the text area and the surrounding background, and the text detection is facilitated;

3. according to the method, the quantity ratio of the positive samples to the negative samples is controlled to be 1:3 during SVM training, so that the screening effect can be optimized, and the accuracy of text region acquisition is further improved;

4. according to the invention, when the original image is intercepted from the video, the video image is intercepted according to the frame, and then the partial area of the video image is intercepted to be used as the original image, so that the identification precision can be improved, and the detection time can be reduced.

Drawings

FIG. 1 is a flow chart of an embodiment of a real-time video caption translation method according to the present invention;

FIG. 2 is a diagram of a Laplace operation template according to the present invention;

FIG. 3 is a schematic diagram of boundary keypoints;

FIG. 4 is a text line constraint parameter specification diagram;

FIG. 5 is a letter width to space ratio statistical chart.

Detailed Description

In order to make the technical solutions in the embodiments of the present invention better understood and make the above objects, features and advantages of the embodiments of the present invention more comprehensible, the technical solutions in the embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

In the description of the present invention, unless otherwise specified and limited, it is to be noted that the term "connected" is to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, or a communication between two elements, or may be a direct connection or an indirect connection through an intermediate medium, and a specific meaning of the term may be understood by those skilled in the art according to specific situations.

Referring to fig. 1, a flowchart of an embodiment of a video subtitle real-time translation method according to the present invention is shown. The video subtitle real-time translation method can comprise the following steps:

and S101, performing multi-channel extraction on the original image intercepted from the video to obtain a plurality of single-channel images.

In this embodiment, the video resources may be divided into two types: one is local video which can be played off-line, and the other is online video which needs to be played on the internet. Aiming at a local video, corresponding software can be provided for a user, the software can comprise an offline translation database, when the user establishes connection between the local video and the software, the software can perform text recognition on subtitles in the local video according to the method in the patent, after the text recognition is completed, the software can automatically translate the recognized text by adopting the offline translation database, and the translation result is returned and transmitted to the local video for display; aiming at the online video, corresponding software can be provided for a user, a Web server can also be set up to provide Web online service for the user, when the user establishes a link between the online video and the Web server, the Web server can perform text recognition on subtitles in the online video according to the method in the patent, and after the text recognition is completed, the Web server can translate the recognized text and return the translation result to the online video for display.

In order to implement a real-time translation function of a video, a video image may be cut out by frames when an original image is cut out from the video, and in order to improve recognition accuracy and reduce detection time, the cut-out video image may be subjected to region cutting, for example, only the bottom two thirds of the region of the image is cut out as the original image.

After the original image is cut from the video, the original image may first be pre-processed, including sharpening and blurring. When the original image is subjected to sharpening preprocessing, sharpening processing can be performed according to formula (1):

g(x,y)＝f(x,y)+c[▽²f(x,y)](1)

wherein, g (x, y) and f (x, y) respectively represent the original image and the original image after the sharpening pretreatment, the value of c depends on the template used for the sharpening, and if the template shown in fig. 2(a) or fig. 2(b) is used, c is-1; if two templates shown in fig. 2(c) are used, c is 1. The sharpened original image can enhance the contrast of the text region and the surrounding background thereof, and is more favorable for text detection.

When the original image is subjected to blur preprocessing, the original image may be subjected to gaussian filtering according to equation (2):

wherein f (x) represents the original image after fuzzy preprocessing, mu represents the mean value of random variables following normal distribution, and sigma represents the average value of random variables²Representing the variance of the random variable x. According to the invention, the original image is subjected to fuzzy preprocessing, so that a text area under a complex background can be more prominent, and text detection is more facilitated.

After the original image is preprocessed, image extraction of six channels of R (red), G (green), B (blue), H (hue), S (saturation), and V (brightness) may be performed on the original image and the preprocessed original image, so as to obtain a plurality of single-channel images. The invention can effectively utilize the color information by extracting the multi-channel image, and provides richer basic data for extracting the text regions, thereby ensuring that the subtitles recognized and translated based on the text regions are more accurate.

And S102, respectively extracting MSER areas of the original image and the single-channel images based on the MSER algorithm.

In this embodiment, in order to accelerate the extraction rate, the following settings are made for the parameters involved in the MSER algorithm: the threshold was set at 5 steps, with a minimum MSER area of 80 and a maximum MSER area of 14400. Since the MSER algorithm is an image extraction method known in the art, detailed description of a specific extraction process of the MSER algorithm is omitted here.

Step S103, local contrast text features are introduced, the local contrast between each MSER region and the background region of each MSER region is calculated, and whether the corresponding MSER region is filtered or not is determined according to each local contrast.

In this embodiment, the MSER regions extracted in step S102 are not all text regions, and the applicant has found that the text to be identified must have a certain contrast with its background, and the contrast of the text region and its background region, and the contrast of the non-text region and its background region are not the same, and the contrast of the former is higher than that of the latter. Based on the characteristic, the invention introduces the characteristic of local contrast to filter out non-text areas. First, the local contrast lc between each MSER region and its background region can be calculated using the following equation (3):

wherein n represents the number of pixels corresponding to the MSER region, k represents the number of pixels corresponding to the background region, and R_i、G_i、B_iThe values of three channels of the image corresponding to the MSER area are respectively represented, i represents the ith pixel point corresponding to the MSER area, and j represents the jth pixel point corresponding to the background area.

Then, whether to filter out the corresponding MSER regions may be determined according to the local contrast of each MSER region, where for each MSER region, if the local contrast of the MSER region is smaller than a first preset threshold, the MSER region is filtered out, otherwise, the MSER region is not filtered out. The applicant has found that the local contrast lc of the non-text region is generally less than 0.35, i.e. the first preset threshold may be 0.35. Although the multi-channel image extraction is carried out on the image in the step S101, rich data basis can be provided for more obtained text regions, more non-text regions are introduced at the same time, and partial non-text region interference items in the MSER region can be eliminated by adopting the local contrast to filter the MSER region, so that the text region extraction accuracy can be improved. In addition, the time complexity of filtering the non-text region according to the local contrast is linear time, so the time required by filtering is short, and a foundation can be provided for real-time translation of the video subtitles.

And step S104, introducing the text characteristics of the boundary key points, and determining the boundary key points of each MSER area.

In this embodiment, when determining the key point of the boundary of each MSER region, first, image binarization is performed on each MSER region, where for each MSER region, the gray value of the pixel point of the MSER detected in the MSER region is set to 255, and the gray values of other pixel points are set to 0. And then, successively traversing each pixel point in the MSER area, if the gray value of the pixel point is 255 and the gray value of at least one of the adjacent pixel points is 0, determining the pixel point as a contour point, wherein the pixel points in the MSER area can be successively traversed from top to bottom in a left-to-right sequence, if the gray value p (X, Y) of the pixel point is 255 and the gray value p (X +1, Y) of the adjacent pixel point on the right side, the gray value p (X-1, Y) of the adjacent pixel point on the left side, the gray value p (X, Y +1) of the adjacent pixel point on the upper side and a value p (X, Y-1) of the adjacent pixel point on the lower side are 0, determining the pixel point as the contour point, wherein X represents the X-axis coordinate of the pixel point and Y represents the Y-axis coordinate of the pixel point.

After all contour points of at least one MSER region are obtained, each contour point is compressed by adopting a Douglas-Pock algorithm, redundant points are removed, and boundary key points corresponding to the MSER region are obtained, as shown in FIG. 3. After all contour points of one MSER area are obtained, each contour point is compressed by adopting a Douglas-Pock algorithm, redundant points are removed, and boundary key points (namely contour points left after the redundant points are removed) of the MSER area are obtained; or after all contour points of a preset number of MSER regions are obtained, compressing each contour point by adopting a Douglas-Pock algorithm, removing redundant points and obtaining boundary key points of the preset number of MSER regions; or after all contour points of all MSER areas can be obtained, each contour point is compressed by adopting a Douglas-Pock algorithm, redundant points are removed, and boundary key points of all MSER areas are obtained. The applicant researches and discovers that the number of the boundary key points k of the english letters is generally between 5 and 16, namely, the preset numerical range is 5 to 16, and when the number of the boundary key points k of the MSER region is less than 5 or more than 16, the MSER region can be determined to be a non-text region when translating english.

And S105, taking the boundary key points as classification screening characteristics, and performing classification screening on each MSER region left after filtering through the trained SVM to obtain a text region.

In this embodiment, after the threshold filtering in step S103 is completed, the present invention selects the width-to-height ratio (w/h) and the area-to-perimeter ratio (MSER-to-perimeter ratio) of each remaining MSER region after filtering, in addition to using the boundary key points as the classification and screening featuresConvex hull area ratio (a)_cA), stroke width area ratio (w)_sA) as a classifying and screening feature, thereby obtaining a text region, wherein w represents the width of the MSER region, h represents the height of the MSER region,denotes the square of the area of the MSER region, p denotes the perimeter of the MSER region, a_cDenotes the area of the convex hull (the convex hull is a common term in image processing and is not described herein), a denotes the area of the MSER region, w_sRepresentation diagramThe stroke width of the image. In order to optimize the screening effect, the training parameters may be set as follows: the kernel function adopts a radial basis function RBF, and the iteration times are selected to be 100 times. The invention classifies the MSER regions left after filtering by adopting the trained SVM classifier, thereby improving the accuracy of obtaining the text region. In addition, in order to achieve the best classification effect of the SVM, the quantity ratio of the positive samples to the negative samples is controlled to be 1:3, the positive sample is a letter corresponding to the translation target language (for example, when the translation target language is english, the corresponding letter may include ' a ' - ' Z ' and ' a ' - ' Z) and an arabic numeral (for example, the numeral ' 0 ' - ' 9 '), and the negative sample is a non-text region that is manually identified and marked on the extracted MSER region after the MSER regions of the original image and the multiple single-channel images are respectively extracted in step S102, so that the screening effect may be optimized, and the accuracy of obtaining the text region may be further improved.

In the outline pixel point set of a region, a part of points are connected in a certain sequence, so that the region can be restored to the maximum extent. Because the boundary key points of the image cannot be influenced even if the image is rotated and zoomed, the invention can eliminate non-text region interference items in the MSER region even if the image is rotated and zoomed by introducing the boundary key points as the classification screening characteristics, thereby improving the sensitivity of text region extraction on image rotation, size change and the like.

And S106, distinguishing each text region according to the distance between every two adjacent text regions in the vertical direction, and classifying each text region in the same text line according to the distance between every two adjacent text regions in the same text line.

In this embodiment, referring to fig. 4 and 5, when text line distinction is performed on each text region according to the distance between every two adjacent text regions in the vertical direction, each phase in the vertical direction may be calculated according to formula (4) firstDistance d between two adjacent text regions_v：

Wherein, b₁Y-axis coordinate, t, representing the bottom of a vertically adjacent upper text region₂Representing top Y-axis coordinates, h, of vertically adjacent lower text regions₂Indicating the height of the text region adjacent to the lower side in the vertical direction, as shown in fig. 4. Then, for every two adjacent text regions in the vertical direction, if the distance d between the two adjacent text regions is smaller than the predetermined distance_vIf the number of the adjacent text areas is larger than the second preset threshold value, the two adjacent text areas are classified into the same text line, otherwise, the two adjacent text areas are classified into different text lines. The applicant researches and discovers that the distance d between two adjacent text areas in the vertical direction_vIf the second predetermined threshold is greater than 0.62, the two adjacent text regions are in the same text line, so the second predetermined threshold may be 0.62.

In addition, when the words of the same text line are distinguished according to the distance between every two adjacent text regions on the same text line, the distance d between every two adjacent text regions on the same text line can be calculated according to the formula (5) first_h：

Wherein,represents the average value of the widths of all the letters of the text line, and Δ d represents the distance difference between the adjacent letters in the two adjacent text areas in the X-axis direction, i.e., the interval between the two adjacent letters. Then, for every two adjacent text regions of the same text line, if the distance d between the two adjacent text regions_hIf the value is larger than the third preset threshold value, the adjacency is representedAdjacent letters in the two text areas belong to the same word, and at the moment, the two adjacent text areas are classified into the same class; otherwise, the adjacent letters of the two adjacent text regions belong to different words, and the two adjacent text regions are classified into different categories. Because different letters of the same word may be divided into different text regions in the process of obtaining the text region, and the interval between the words and the interval between the letters in the words have obvious difference, as shown in fig. 5, the word recognition accuracy can be improved by distinguishing the words on the same text line after the text region is distinguished, and the two layers of text classification algorithms of text line classification in the vertical direction and text region classification in the same text line in the horizontal direction are adopted in the invention, so that the time complexity (the time complexity of the same type of algorithm is O (n) greatly²) The time complexity of the present invention is O (nlog)₂n)), the word recognition rate is improved, and a foundation is provided for realizing real-time translation of the video subtitles. The applicant researches and discovers that when the distance d between two adjacent text areas_hIf the number of the adjacent letters in the adjacent text regions is greater than 2.33, the adjacent letters in the adjacent text regions belong to the same word, and therefore the third preset threshold may be 2.33.

And S107, performing real-time translation on the video subtitles based on the classified text regions.

In this embodiment, after obtaining each classified text region, text recognition may be performed by using an open-source framework Tesseract, and meanwhile, for unified management of the system, the Tesseract and the OpenCV image processing runtime library need to be integrated. After the text is recognized, the text can be transmitted to an interface provided by Google translation in the form of letter strings, a translation result is obtained and finally displayed to a user, and therefore real-time translation of the video subtitles is achieved.

It can be seen from the above embodiments that, before the text is identified, the present invention introduces a plurality of single-channel images, effectively utilizes the color information of the original image, provides richer basic data for the extraction of the text region, then introduces local contrast text features, performs threshold filtering on the MSER region extracted from the original image and the plurality of single-channel images, can improve the accuracy of the extraction of the text region, and the time complexity of the local contrast filtering is linear time, the filtering time is short, can provide a basis for real-time translation of video subtitles, introduces boundary key points as SVM classification filtering features, can eliminate non-text region interference in the MSER region even when the image is rotated and scaled, thereby can improve the sensitivity of the extraction of the text region to image rotation and scaling, after the MSER region is subjected to threshold filtering based on the local contrast, the invention adopts two layers of text classification algorithms of text line classification in the vertical direction and text region classification in the same text line in the horizontal direction aiming at the text region obtained after the training and screening are finished, thereby greatly reducing the time complexity, improving the word recognition rate, providing a foundation for realizing the real-time translation of the video captions, and realizing the real-time and accurate translation of the video captions.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for translating video subtitles in real time is characterized by comprising the following steps:

2. The method for real-time translation of video subtitles according to claim 1, wherein before performing multi-channel extraction on an original image intercepted from a video to obtain a plurality of single-channel images, the method further comprises: preprocessing the original image including sharpening and blurring.

3. The method for real-time translation of video subtitles according to claim 2, wherein the step of performing multi-channel extraction on the original image captured from the video to obtain a plurality of single-channel images comprises: and respectively carrying out R, G, B, H, S, V six-channel image extraction on the original image and the preprocessed original image so as to obtain a plurality of single-channel images.

4. The method of claim 1, wherein the calculating the local contrast between each MSER region and its background region, and determining whether to filter out the corresponding MSER region according to each local contrast comprises:

<mrow> <mi>l</mi> <mi>c</mi> <mo>=</mo> <mfrac> <mrow> <mo>|</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mrow> <mi>max</mi> <mrow> <mo>(</mo> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>,</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

for each MSER area, if the local contrast of the MSER area is smaller than a first preset threshold, filtering the MSER area.

5. The method for real-time translation of video subtitles according to claim 1, wherein the determining the key points of the boundary of each MSER region comprises:

6. The method for real-time translation of video subtitles of claim 1, wherein the aspect ratio, the area-to-perimeter ratio, the convex hull area ratio and the stroke width-to-area ratio of each MSER region left after filtering are used as classification and screening features, and classification and screening are performed on each MSER region left after filtering through a trained SVM.

7. The method for real-time translation of video subtitles according to claim 6, wherein in the process of training the SVM, the ratio of the number of the positive samples to the number of the negative samples is controlled to be 1: and 3, the positive sample is a letter and an Arabic number corresponding to the translation target language, and the negative sample is a non-text region which is obtained by manually identifying and marking the extracted MSER region after the MSER regions of the original image and the multiple single-channel images are respectively extracted.

8. The method for translating video subtitles in real time according to claim 1, wherein the text line distinguishing of each text region according to the distance between every two adjacent text regions in the vertical direction comprises:

Wherein, b₁Y-axis coordinate, t, representing the bottom of a vertically adjacent upper text region₂Representing top Y-axis coordinates, h, of vertically adjacent lower text regions₂Representing vertically adjacent lower contextsThe height of the region;

9. The method for real-time translation of video subtitles according to claim 1, wherein the classifying the text regions of the same text line according to the distance between every two adjacent text regions on the same text line comprises:

<mrow> <msub> <mi>d</mi> <mi>h</mi> </msub> <mo>=</mo> <mfrac> <mover> <mi>w</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>&Delta;</mi> <mi>d</mi> </mrow> </mfrac> </mrow>

Wherein,representing the average value of the widths of all letters of the text line, and delta d representing the distance difference of adjacent letters of two adjacent text areas in the X-axis direction;

10. The method of claim 1, wherein the video subtitle real-time translation method is characterized in that when an original image is cut out from a video, a video picture is cut out by frames, and two thirds of the area below the cut-out video image is used as the original image.