CN108038458B

CN108038458B - Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram

Info

Publication number: CN108038458B
Application number: CN201711381971.5A
Authority: CN
Inventors: 黄晓冬; 王勤
Original assignee: Capital Normal University
Current assignee: Capital Normal University
Priority date: 2017-12-20
Filing date: 2017-12-20
Publication date: 2021-04-09
Anticipated expiration: 2037-12-20
Also published as: CN108038458A

Abstract

A method for automatically acquiring an outdoor scene text in a video based on a characteristic summary map comprises the following steps of firstly acquiring a video frame image of a scene text, and generating a video frame characteristic summary map based on an RGB color space of the video frame image: firstly, four convolution graphs in four directions of horizontal, vertical, 45-degree and 135-degree are respectively extracted from an RGB color space to obtain four direction feature vectors representing the color space, and then ten saliency maps representing video frames in different directions are obtained and are subjected to fusion calculation to obtain a video frame feature summary map. And then carrying out K-means color clustering calculation based on the video frame characteristic abstract diagram and the RGB color space to obtain four types of results of four areas representing background, foreground character, character outline and noise, then respectively analyzing connected domains of the four types of results, deleting two areas of background and noise, and automatically obtaining the final outdoor scene text. The method has the advantages of simple operation steps, easy calculation, real-time identification and acquisition of the outdoor scene text, and good popularization and application prospects.

Description

Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram

Technical Field

The invention relates to a digital image processing method, in particular to an outdoor scene text automatic acquisition method in a video based on a characteristic abstract diagram, and belongs to the technical field of computer vision processing.

Background

In the last few years, content-based image understanding techniques have gained increasing attention as digital image capture devices, smart phones, and utility vision systems and devices become popular. Because the scene text in the image/video has rich and direct semantic information clues, the scene text is considered as an important object that must be detected and recognized. Text detection, positioning, extraction and identification are main steps for acquiring text information. The operations of text detection, location and extraction are commonly referred to collectively as text retrieval. Text acquisition is a very important prerequisite for text recognition, as it reduces the complex background, eliminates lighting effects, making recognition relatively simple and easy. However, due to uneven illumination indoors and outdoors, blurring of images/videos, complexity of background, perspective distortion, color diversity, complexity of fonts and difference of stroke widths, and other adverse factors, the acquisition of texts of video scenes is very challenging and severe.

At present, researchers at home and abroad have successfully developed various methods in the aspect of the acquisition technology of video scene texts. Now, the extraction of scene text is divided into two steps: (1) detecting and positioning scene texts, and (2) extracting the scene texts.

The scene text detection and positioning method in the prior art can be divided into the following steps: four different scene text detection methods, color-based, edge/gradient-based, texture-based, and stroke-based. Wherein:

color-based scene text detection: this is a traditional method that has been proposed for a long time and has been used for more than 20 years, which is simple and efficient: usually, a scene text detection algorithm based on a local threshold is adopted, and some researchers also adopt a local threshold acquisition method in an improved Niblack algorithm, so that the method can be used for quickly detecting some scene texts with simpler backgrounds. Researchers have also proposed using a mean shift (mean shift) algorithm to generate color layers in order to significantly improve the robustness of text detection in complex contexts. However, color feature based text detection suffers from a number of difficulties when multiple colors of characters and illumination non-uniformity are present in the video/image.

Edge/gradient-based scene text detection: assuming that a strong and symmetric change occurs in a text region displayed on a background region, pixels with large, symmetric gradient values can be regarded as text pixels, which can use edge features and gradient features in scene text detection. Researchers also propose a scene text detection algorithm based on edge enhancement. Such studies include clustering horizontally arranged "gradient vector streams" to find text candidate regions based on spatial constraints of size, location and color distance. Currently, researchers propose AdaBoost classifier-based scene text detection algorithms that combine gradient/edge features with various classifiers (e.g., artificial neural networks or AdaBoost algorithms); even further, a method for detecting the text locator based on the neural network is added on the basis of the AdaBoost classifier. However, such algorithms have difficulty detecting scene text in complex backgrounds with strong gradients.

Scene text detection based on texture: when the character regions are dense, the scene text may be considered as a kind of texture. Many current methods use texture feature extraction to detect scene text, including fourier transform, dct (discrete Cosine transform), wavelet, local Binary pattern lbp (local Binary pattern), histogram of Oriented gradients (hog), and so on. Although texture features can be used to efficiently detect dense characters, this approach may not be able to detect sparse characters. Therefore, researchers respectively propose a method for detecting scene texts based on fourier frequency domain features and a method for detecting scene texts based on DCT coefficients in a frequency domain. Recently, a scene text algorithm based on Local Haar Binary Pattern (Local Binary Pattern) feature detection is proposed. However, when the background to be presented is complicated, many background noises also show textures similar to texts, thus reducing the detection accuracy of the method.

The stroke-based scene text detection method comprises the following steps: the stroke Width transform swt (stroke Width transform) is used to calculate the most likely stroke pixel Width. Stroke-based features have proven to be very effective for the detection of high-resolution scene text, especially when combined with appropriate learning methods, or when stroke features are fused with other features including edge orientation difference eov (edge orientation variance), opposite edge pair oeps (porous edge pairs), or spatial-temporal analysis (spatial-temporal analysis). Recently, Bandlet-based edge detectors have been introduced to improve SWT, enhance the edge differences of scene text, and eliminate noisy edges, enabling SWT to be used in the detection of low-resolution text. However, when detecting scene text having characters of various sizes and fonts, the detection accuracy of this method is significantly reduced to a large extent.

The scene text extraction method in the prior art at least comprises the following steps: three text extraction algorithms, threshold-based, color-based, and character stroke-based. Wherein:

threshold-based text extraction algorithm: the method is divided into two subclasses of algorithms: one is to use a global threshold method, such as Otsu's algorithm, etc.; another is to use a local threshold method. Now, a multi-threshold algorithm is proposed: the second stage threshold in the algorithm depends on the threshold basis of the first stage, which significantly enhances the extraction. However, since the threshold-based method does not consider the characteristics of the scene text, the method is not satisfactorily performed and popularized.

Color-based text extraction algorithms: the method is to first generate several candidate binary images using k-means or other clustering algorithms and then select a binary image based on image analysis. The method is characterized in that the color of the text is assumed to be consistent, and color clustering is introduced into the extraction of the scene text. The disadvantages are that: because it belongs to a global calculation method, it is sensitive to non-uniform illumination, and the calculation cost and the selection of the parameter k in analyzing a plurality of candidate images are very complicated.

Text extraction algorithm based on character strokes: two sets of asymmetric Gabor filters are used to extract the texture direction and scale in the image, and these features are used to represent the edges of text characters most likely to enhance contrast. However, this algorithm is sensitive to the extracted character size and is not suitable for extracting scene text in video.

In summary, the above-mentioned detection and positioning of scene texts and the extraction technology of scene texts in the prior art have various unsatisfactory points, so how to develop a method for acquiring scene texts in a video with better performance or perfect characteristics becomes a new subject of great attention by technical personnel in the industry.

Disclosure of Invention

In view of this, an object of the present invention is to provide a method for automatically acquiring an outdoor scene text in a video based on a feature abstract diagram, which can better solve various defects in the prior art, and can correctly and completely acquire a scene text under various conditions, such as perspective distortion, various colors, complicated fonts, and unequal stroke widths, under an uneven lighting, fuzzy, or complex background. .

In order to achieve the above object, the present invention provides an outdoor scene text automatic acquisition method in a video based on a feature abstract diagram, which is characterized in that: the method comprises the following operation steps:

step 1, acquiring a video frame image of a scene text, and generating a video frame characteristic abstract map based on a red, green and blue (RGB) color space of the video frame image: firstly, respectively extracting four convolution maps including a horizontal direction, a vertical direction, a 45-degree direction and a 135-degree direction on an RGB color space to obtain four direction characteristic vectors for representing the RGB color space; then, carrying out product operation on the four direction feature vectors in pairs respectively to obtain ten saliency maps respectively representing video frames in different directions; then, performing fusion calculation on the ten saliency maps in different directions to obtain a video frame characteristic primary map which is used as a visual representation of a scene text in a subsequently obtained video, and deleting background and noise interference to improve the identification accuracy;

step 2, automatically acquiring a scene text: firstly, performing K-means color clustering calculation based on the video frame characteristic abstract diagram and an RGB color space, and subdividing the video frame abstract diagram into four types of results of four areas respectively representing background, foreground character, character outline and noise; and then, respectively carrying out connected domain analysis on the four types of results, deleting two areas of background and noise, and obtaining a final scene text.

At present, under the conditions of complex background and changeable illumination, the text of the outdoor video scene is very difficult to acquire. The method is an innovative method for automatically acquiring the outdoor scene text in the video, and has the technical key that how to acquire a brand-new video frame characteristic abstract diagram is provided and used as a visual representation and basis for automatically acquiring the scene text in the video; meanwhile, a color space based on a video frame characteristic abstract diagram and hue, Saturation and brightness HSV (hue Saturation value) is adopted for carrying out K-means color clustering, then analysis processing of a connected domain based on character stroke width and a connected domain based on geometric shape is respectively carried out, and after a background area and a noise area are deleted, a final video scene text can be quickly and automatically obtained.

Multiple simulation implementation tests prove that the method can well overcome the defects of the prior art, can still quickly and accurately automatically detect and extract the scene text when the outdoor video is in the environment with complex background, perspective deformation, various colors, uneven or strong illumination, complex fonts and different stroke widths, has simple operation steps, low calculation complexity and easy realization, and can meet the requirements of real-time identification and acquisition of the outdoor scene text, thereby having good popularization and application prospects.

Drawings

Fig. 1 is a flowchart of the operation steps of the method for automatically acquiring the text of the outdoor scene in the video based on the feature summary chart.

Fig. 2 is a flowchart of the operation steps of step 1 of the method for automatically acquiring the text of the outdoor scene according to the present invention.

Fig. 3 is a flowchart of the operation steps of step 2 of the method for automatically acquiring the text of the outdoor scene according to the present invention.

Fig. 4(a), (B), and (C) are schematic diagrams of an original image, a video frame feature summary diagram, and three steps of an obtained scene text embodiment of the outdoor scene text automatic acquisition method according to the embodiment of the present invention, respectively.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples.

Referring to fig. 1, the specific operation steps of the method for automatically acquiring the text of the outdoor scene in the video based on the characteristic abstract diagram are described as follows:

step 1, acquiring a video frame image of a scene text, and generating a video frame characteristic abstract map based on a red, green and blue (RGB) color space of the video frame image: firstly, respectively extracting four convolution maps including a horizontal direction, a vertical direction, a 45-degree direction and a 135-degree direction on an RGB color space to obtain four direction characteristic vectors for representing the RGB color space; then, carrying out product operation on the four direction feature vectors in pairs respectively to obtain ten saliency maps respectively representing video frames in different directions; and then performing fusion calculation on the ten salient images in different directions to obtain a video frame characteristic primary image which is used as a visual representation of a scene text in a subsequently obtained video, and deleting background and noise interference to improve the identification accuracy.

In four convolution graphs which are respectively extracted from an RGB color space and comprise four directions of horizontal direction, vertical direction, 45 degrees and 135 degrees, a horizontal direction convolution kernel adopted by the horizontal convolution graph is a calculation template for calculating horizontal direction differential in a Sobel operator:

the vertical convolution kernel adopted by the vertical convolution graph is a calculation template for calculating vertical differential in a Sobel operator:

the 45-degree direction convolution kernel adopted by the 45-degree convolution graph is a calculation template for calculating 45-degree direction differential:

the 135-degree direction convolution kernel adopted by the 135-degree convolution graph is a calculation template for calculating 135-degree direction differential:

the convolution kernel-based convolution graph feature extraction method has the characteristics of simple algorithm, high operation speed and contribution to engineering realization, and the extracted convolution features are not easily influenced by illumination change in an outdoor scene.

The color clustering calculation in the K-means clustering algorithm is realized by performing four-dimensional clustering calculation in a four-dimensional space of a video abstract chart, chromaticity, saturation and lightness according to the cosine distances of included angles between each pixel and four clustering central points. Because the illumination change in the outdoor environment is severe, each character presents different colors on the video frame, and the integrity of the extracted character strokes is seriously influenced; therefore, the four dimensions are adopted for clustering, and the influence of illumination change of outdoor environment on character color can be obviously reduced by adopting the calculation of an included angle cosine distance function which is different from the common Euclidean distance and does not pay attention to numerical value difference.

Referring to fig. 2, the specific operation content of step 1 in the above two operation steps is described in detail:

(11) firstly, respectively extracting horizontal direction convolution graphs R on red channels_hVertical convolution graph R_v45 degree direction convolution graph R_lAnd 135 degree direction convolution graph R_rExtracting the horizontal convolution graph G on the green channel respectively_hVertical convolution G_v45 degree direction convolution graph G_lAnd 135 degree direction convolution graph G_rExtracting the horizontal convolution images B on the blue channels respectively_hVertical convolution map B_v45 degrees, respectivelyDirection convolution diagram B_lAnd 135 degree direction convolution map B_r(ii) a And then arranging the convolution maps in all directions according to an RGB color space to obtain four direction characteristic vectors for representing the RGB color space: horizontal direction feature vector H ═ R_h,G_h,B_hR, a vertical direction feature vector V ═ R_v,G_v,B_vR, a 45-degree direction feature vector L ═ R_l,G_l,B_lR, a 135-degree direction feature vector R ═ R_r G_r,B_r}。

(12) And performing multiplication operation on the four direction feature vectors by two vectors respectively to obtain ten significant images representing different directions of the video frame, so that the background and noise interference in other directions can be deleted while a plurality of edge features in the set direction are kept, the stroke features of the scene text in multiple directions can be obtained, and the automatic extraction of the scene text can be facilitated. This step (12) is subdivided into the following operations:

(120) according to formula S_hh＝{R_h,G_h,B_h}×{R_h,G_h,B_hCalculating the product of the squares of the horizontal direction feature vectors to obtain a horizontal direction saliency map S_hhFor preserving and strengthening the edge features in the horizontal direction and weakening the edge features in other directions.

(121) According to formula S_vv＝{R_v,G_v,B_v}×{R_v,G_v,B_vCalculating the product of the squares of the feature vectors in the vertical direction to obtain a vertical saliency map S_vvFor preserving and strengthening the edge features in the vertical direction and weakening the edge features in other directions.

(122) According to formula S_ll＝{R_l,G_l,B_l}×{R_l,G_l,B_lCalculating the product of the squares of the 45-degree direction feature vectors to obtain a 45-degree direction saliency map S_llAnd the method is used for retaining and strengthening the edge characteristics in the 45-degree direction and weakening the edge characteristics in other directions.

(123) According to formula S_rr＝{R_r,G_r,B_r}×{R_r,G_r,B_rCalculating the product of the squares of the 135-degree direction feature vectors to obtain a 135-degree direction saliency map S_rrAnd is used for retaining and strengthening the edge characteristics in the 135-degree direction and weakening the edge characteristics in other directions.

(124) According to formula S_hv＝{R_h,G_h,B_h}×{R_v,G_v,B_vCalculating the product of multiplication of feature vectors in the horizontal direction and the vertical direction to obtain a saliency map S in the horizontal direction and the vertical direction_hvThe method is used for retaining and strengthening the edge features in the horizontal and vertical directions and weakening the edge features in other directions.

(125) According to formula S_hl＝{R_h,G_h,B_h}×{R_l,G_l,B_lCalculating the product of multiplication of feature vectors in two directions of horizontal and 45 degrees to obtain a horizontal 45-degree direction saliency map S_hlAnd the method is used for retaining and strengthening the edge characteristics in the horizontal 45-degree direction and weakening the edge characteristics in other directions.

(126) According to formula S_hr＝{R_h,G_h,B_h}×{R_r,G_r,B_rCalculating the product of multiplication of horizontal and 135-degree direction feature vectors to obtain a horizontal 135-degree direction saliency map S_hrAnd the method is used for retaining and strengthening the edge characteristics in the horizontal 135-degree direction and weakening the edge characteristics in other directions.

(127) According to formula S_vl＝{R_v,G_v,B_v}×{R_l,G_l,B_lCalculating the product of multiplication of feature vectors in two directions of vertical and 45 degrees to obtain a saliency map S in the direction of vertical 45 degrees_vlAnd the method is used for retaining and strengthening the edge characteristics in the vertical 45-degree direction and weakening the edge characteristics in other directions.

(128) According to formula S_vr＝{R_v,G_v,B_v}×{R_r,G_r,B_rCalculating the product of multiplication of feature vectors in two directions of vertical and 135 degrees to obtain a saliency map S in the direction of vertical 135 degrees_vrAnd is used for retaining and strengthening the edge characteristics in the direction perpendicular to 135 degrees and weakening the edge characteristics in other directions.

(129) Push buttonAccording to formula S_lr＝{R_l,G_l,B_l}×{R_r,G_r,B_rCalculating the product of multiplication of feature vectors in two directions of 45 degrees and 135 degrees to obtain a 45-degree 135-degree direction saliency map S_lrAnd the method is used for retaining and strengthening the edge characteristics in the 45-degree 135-degree direction and weakening the edge characteristics in other directions.

(13) And performing fusion calculation on the ten salient images in different directions to obtain a video frame characteristic abstract image, providing visual representation for subsequently acquiring a scene text in a video, deleting background and noise interference, and improving the accuracy and the integrity of an automatic acquisition result of the scene text.

In this step (13), the operation content of the fusion calculation is performed on the ten saliency maps in different directions: and (4) respectively performing corresponding operation of the maximum value, the minimum value and the average value of the same coordinate pixels in each image based on the ten saliency maps in different directions extracted in the step (12), and superposing the operation results to obtain a final video frame summary map fsg. The following operation of this step (13) is specifically described below:

(131) selecting the minimum value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form a minimum feature saliency map S_min(x,y)＝min(p_i(x, y)), wherein p_i(x, y) is the coordinate (x, y) pixel value of each saliency map, the subscript i is the saliency map class, and i e { S ∈ { S }_hh,S_vv,S_ll,S_rr,S_hv,S_hl,S_hr,S_vl,S_vr,S_lrThe function min is the extraction pixel p_iThe operand of the (x, y) minimum value.

(132) Selecting the maximum value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form a maximum characteristic saliency map S_max(x,y)＝max(p_i(x, y)), where the function max is the extraction pixel p_i(x, y) the arithmetic sign of the maximum value.

(133) Selecting the average value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form an average characteristic saliency map S_mean(x,y)＝mean(p_i(x, y)), wherein the function mean is the extraction of the co-located pixel p_iThe operation sign of the (x, y) average value.

(134) In order to keep the integrity of the edge features of the characters in each direction as much as possible in the video frame abstract graph and reduce the illumination change influence easily appearing in the outdoor environment video, the minimum, maximum and average three feature saliency graphs capable of keeping the edge features in different directions are utilized to perform fusion calculation according to a formula:

a final video frame feature summary map fsg is obtained.

Referring to fig. 3 again, the specific operation content of step 2 in the above two operation steps is described in detail:

(21) based on a K-means clustering algorithm, carrying out color clustering calculation on a color space HSV (hue validation value) of the chroma, the Saturation and the lightness of a video frame feature summary map: and distinguishing the video frame feature abstract graph into four types of K mean value color clustering results respectively representing four areas of a background, foreground character, character outline and noise.

(22) Connected domain processing based on stroke width: and respectively calculating the stroke width of the edge pixels of each connected domain according to the four types of K-means color clustering results, analyzing each connected domain based on the stroke width, and deleting the background region and the noise region. This step (22) includes the following operations.

(221) Based on the ten saliency maps in step 1, calculating a gradient direction angle θ of each pixel in the video frame summary map:

(222) deleting connected domains connected with the upper, lower, left, right and periphery boundaries of the image in the video abstract picture because the character regions can not appear at the video image boundary or are connected with the video image boundary;

(223) and acquiring boundary pixels of each connected domain, searching each boundary pixel forwards according to the gradient direction angle theta of the boundary pixel until another boundary pixel is found, and setting the pixel values of the two boundary pixels as the stroke widths of the two pixels.

(224) Firstly, calculating the stroke widths of all boundary pixels of the same connected domain, and then calculating the variance of the stroke widths of all the boundary pixels; if the result value of the variance calculation is less than 0.5, the boundary pixel stroke width value of the connected domain is considered to be close to the actual value, and the boundary pixel stroke width value is kept as a candidate character region (the basis of the operation is that the region length-width ratios of the western and Chinese characters are relatively constant set values, namely the character stroke width values of the western or Chinese characters are close); otherwise, for the region with larger length-width ratio in the connected domain, it is considered not to belong to the character and deleted.

(23) For the less noisy regions that remain after the processing of step (22), a geometry-based connected-domain processing is performed: the area size of each connected domain in the character image, namely the number of pixels contained in the connected domain, is calculated respectively, and the connected domain with a smaller ratio which is regarded as a noise region is deleted, so that the target image quality is improved. The specific operation method of the connected domain processing is to calculate the length of the main axis, and if the length of the main axis is more than one third or less than one tenth of the image width of the video frame feature abstract image, the connected domain is considered to be too large or too small, does not belong to the character area and is deleted.

(24) Acquiring a scene text area: analyzing all connected domains in the four clustering results, merging the final connected domains reserved by all clusters into an image, and judging the similar connected domains as the same region according to two measures of the distance and the stroke width of each connected domain, thereby obtaining the final video scene text.

The method of the invention has been carried out a plurality of times of simulation implementation tests, and the test results are successful. Referring to fig. 4(a), (B), and (C), the three diagrams respectively show the original video frame, the video frame feature summary map obtained in step 1, and the operation results in step 2 in an embodiment of the method of the present invention: and acquiring an example schematic diagram of the outdoor scene text in the video. That is to say, the input is a video frame containing a scene text, and after the processing by the method of the present invention, the video frame is output as an acquired complete scene text, which can be used for the subsequent scene text recognition. Therefore, the invention can well realize the purpose of the invention and has good popularization and application prospect.

Claims

1. A method for automatically acquiring an outdoor scene text in a video based on a characteristic abstract diagram is characterized by comprising the following steps: the method comprises the following operation steps:

step 1, acquiring a video frame image of a scene text, and generating a video frame characteristic abstract map based on a red, green and blue (RGB) color space of the video frame image: firstly, respectively extracting four convolution maps including a horizontal direction, a vertical direction, a 45-degree direction and a 135-degree direction on an RGB color space to obtain four direction characteristic vectors for representing the RGB color space; then, carrying out product operation on the four direction feature vectors in pairs respectively to obtain ten saliency maps respectively representing video frames in different directions; then, performing fusion calculation on the ten saliency maps in different directions to obtain a video frame characteristic summary map;

step 2, automatically acquiring a scene text: firstly, performing K-means color clustering calculation on a color space HSV based on the video frame characteristic abstract map by adopting a K-means clustering algorithm, and subdividing the video frame abstract map into four types of results of four regions respectively representing a background, foreground character, character outline and noise; then, performing connected domain analysis on the four types of results respectively, deleting two areas of background and noise, and acquiring a final scene text;

wherein, the obtaining of the four directional feature vectors for representing the RGB color space includes: respectively extracting horizontal direction convolution images R on red channels_hVertical convolution graph R_v45 degree direction convolution graph R_lAnd 135 degree direction convolution graph R_rExtracting the horizontal convolution graph G on the green channel respectively_hVertical convolution G_v45 degree direction convolution graph G_lAnd 135 degree direction convolution graph G_rExtracting the horizontal convolution images B on the blue channels respectively_hVertical convolution map B_v45 degree direction convolution graph B_lAnd 135 degree direction convolution map B_r(ii) a And then arranging the convolution maps in all directions according to an RGB color space to obtain four direction characteristic vectors for representing the RGB color space: horizontal direction feature vector H ═ R_h,G_h,B_hR, a vertical direction feature vector V ═ R_v,G_v,B_vR, a 45-degree direction feature vector L ═ R_l,G_l,B_lR, a 135-degree direction feature vector R ═ R_r G_r,B_r}。

2. The method of claim 1, wherein: in the four convolution diagrams respectively extracted from the RGB color space and including the horizontal direction, the vertical direction, the 45-degree direction and the 135-degree direction, a horizontal convolution kernel adopted by the horizontal convolution diagram is a calculation template for calculating a horizontal differential in a Sobel operator:

the vertical direction convolution kernel adopted by the vertical direction convolution graph is a calculation template for calculating vertical direction differential in a Sobel operator:

the 45-degree direction convolution kernel adopted by the 45-degree direction convolution diagram is a calculation template for calculating 45-degree direction differential:

the 135-degree direction convolution kernel adopted by the 135-degree direction convolution graph is a calculation template for calculating 135-degree direction differential:

3. the method of claim 1, wherein: the process of obtaining ten saliency maps representing video frames in different directions in step 1 includes:

(120) according to formula S_hh＝{R_h,G_h,B_h}×{R_h,G_h,B_hCalculating the product of the squares of the horizontal direction feature vectors to obtain a horizontal direction saliency map S_hh；

(121) According to formula S_vv＝{R_v,G_v,B_v}×{R_v,G_v,B_vCalculating the product of the squares of the feature vectors in the vertical direction to obtain a vertical saliency map S_vv；

(122) According to formula S_ll＝{R_l,G_l,B_l}×{R_l,G_l,B_lCalculating the product of the squares of the 45-degree direction feature vectors to obtain a 45-degree direction saliency map S_ll；

(123) According to formula S_rr＝{R_r,G_r,B_r}×{R_r,G_r,B_rCalculating the product of the squares of the 135-degree direction feature vectors to obtain a 135-degree direction saliency map S_rr；

(124) According to formula S_hv＝{R_h,G_h,B_h}×{R_v,G_v,B_vCalculating the product of multiplication of feature vectors in the horizontal direction and the vertical direction to obtain a saliency map S in the horizontal direction and the vertical direction_hv；

(125) According to formula S_hl＝{R_h,G_h,B_h}×{R_l,G_l,B_lCalculating the product of multiplication of feature vectors in two directions of horizontal and 45 degrees to obtain a horizontal 45-degree direction saliency map S_hl；

(126) According to formula S_hr＝{R_h,G_h,B_h}×{R_r,G_r,B_rCalculating the product of multiplication of horizontal and 135-degree direction feature vectors to obtain a horizontal 135-degree direction saliency map S_hr；

(127) According to formula S_vl＝{R_v,G_v,B_v}×{R_l,G_l,B_lCalculating the product of multiplication of feature vectors in two directions of vertical and 45 degrees to obtain a saliency map S in the direction of vertical 45 degrees_vl；

(128)According to formula S_vr＝{R_v,G_v,B_v}×{R_r,G_r,B_rCalculating the product of multiplication of feature vectors in two directions of vertical and 135 degrees to obtain a saliency map S in the direction of vertical 135 degrees_vr；

(129) According to formula S_lr＝{R_l,G_l,B_l}×{R_r,G_r,B_rCalculating the product of multiplication of feature vectors in two directions of 45 degrees and 135 degrees to obtain a 45-degree 135-degree direction saliency map S_lr。

4. The method of claim 1, wherein: in the step 1, performing fusion calculation on the ten saliency maps in different directions to obtain a video frame feature abstract map includes: and respectively performing corresponding operation on the maximum value, the minimum value and the average value of the same coordinate pixels in each image based on the ten saliency maps in different directions, and superposing the obtained operation results to obtain a final video frame summary map fsg.

5. The method of claim 4, wherein: the obtaining of the final video frame summary map fsg includes, after the corresponding operations of the maximum value, the minimum value, and the average value of the same coordinate pixels in each image are respectively performed on the ten saliency maps based on different directions and the operation results are superimposed:

(131) selecting the minimum value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form a minimum feature saliency map S_min(x,y)＝min(p_i(x, y)), wherein p_i(x, y) is the coordinate (x, y) pixel value of each saliency map, the subscript i is the saliency map class, and i e { S ∈ { S }_hh,S_vv,S_ll,S_rr,S_hv,S_hl,S_hr,S_vl,S_vr,S_lrThe function min is the extraction pixel p_iThe operation sign of the (x, y) minimum value:

(132) selecting the maximum value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form the maximum valueLarge feature saliency map S_max(x,y)＝max(p_i(x, y)), where the function max is the extraction pixel p_iOperation sign of (x, y) maximum value:

(133) selecting the average value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form an average characteristic saliency map S_mean(x,y)＝mean(p_i(x, y)), wherein the function mean is the extraction of the co-located pixel p_iOperation sign of (x, y) average value:

(134) and performing fusion calculation by using the minimum feature saliency map, the maximum feature saliency map and the average feature saliency map according to a formula:

a final video frame feature summary map fsg is obtained.

6. The method of claim 1, wherein: the step 2 comprises the following operation contents:

(21) carrying out color clustering calculation on the hue space HSV of the chroma, saturation and lightness of the video frame characteristic abstract diagram based on a K-means clustering algorithm: dividing the video frame characteristic abstract graph into four regions respectively representing a background, foreground characters, character outlines and noise, and performing color clustering calculation on the four regions respectively to obtain four types of K-means color clustering results of the four regions;

(22) connected domain processing based on stroke width: respectively calculating the edge pixel stroke width of each connected domain according to the four types of K-means color clustering results, analyzing each connected domain based on the stroke width, and deleting the background region and the noise region;

(23) for the less noisy regions that remain after the processing of step (22), a geometry-based connected-domain processing is performed: respectively calculating the number of pixels contained in each connected domain in the character image, and deleting the connected domain with a smaller ratio which is regarded as a noise region;

(24) acquiring a scene text area: analyzing all connected domains in the four types of K-means color clustering results, combining the final connected domains reserved by all clusters into an image, and judging the similar connected domains as the same region according to two measures of the distance and the stroke width of each connected domain, thereby obtaining the final video scene text.

7. The method of claim 6, wherein: the color clustering calculation of the hue space HSV of the chroma, the saturation and the lightness of the video frame characteristic abstract diagram based on the K-means clustering algorithm comprises the following steps: and performing four-dimensional clustering according to the cosine distance of an included angle between each pixel and the central point of four types of K-means color clusters representing four areas of the background, the foreground character, the character outline and the noise.

8. The method of claim 6, wherein: the step (22) comprises the following operations;

(222) deleting connected domains connected with the upper, lower, left and right boundaries of the image in the video abstract picture;

(223) acquiring boundary pixels of each connected domain, searching each boundary pixel forwards according to the gradient direction angle theta of the boundary pixel until another boundary pixel is found, and setting pixel values in the two acquired boundary pixels as stroke widths of the two boundary pixels;

(224) firstly, calculating the stroke widths of all boundary pixels of the same connected domain, and then calculating the variance of the stroke widths of all the boundary pixels; if the numerical value of the calculated variance is less than 0.5, the stroke width value of the boundary pixel of the connected domain is considered to be close to the actual numerical value, and the character area is reserved as a candidate; otherwise, it is deleted as not belonging to the character.

9. The method of claim 8, wherein: deleting connected domains connected with the upper, lower, left and right boundaries of the image in the video summary map, wherein the deleting comprises the following steps:

deleting an area with a large length-width ratio in the connected domain; and the connected domain is obtained by calculating the length of the main shaft, and if the length of the main shaft is more than one third or less than one tenth of the image width of the video frame characteristic abstract image, the connected domain is considered not to belong to the character region and is deleted.