CN108038458B - Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram - Google Patents

Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram Download PDF

Info

Publication number
CN108038458B
CN108038458B CN201711381971.5A CN201711381971A CN108038458B CN 108038458 B CN108038458 B CN 108038458B CN 201711381971 A CN201711381971 A CN 201711381971A CN 108038458 B CN108038458 B CN 108038458B
Authority
CN
China
Prior art keywords
video frame
convolution
calculating
map
degree direction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711381971.5A
Other languages
Chinese (zh)
Other versions
CN108038458A (en
Inventor
黄晓冬
王勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital Normal University
Original Assignee
Capital Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital Normal University filed Critical Capital Normal University
Priority to CN201711381971.5A priority Critical patent/CN108038458B/en
Publication of CN108038458A publication Critical patent/CN108038458A/en
Application granted granted Critical
Publication of CN108038458B publication Critical patent/CN108038458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour

Abstract

A method for automatically acquiring an outdoor scene text in a video based on a characteristic summary map comprises the following steps of firstly acquiring a video frame image of a scene text, and generating a video frame characteristic summary map based on an RGB color space of the video frame image: firstly, four convolution graphs in four directions of horizontal, vertical, 45-degree and 135-degree are respectively extracted from an RGB color space to obtain four direction feature vectors representing the color space, and then ten saliency maps representing video frames in different directions are obtained and are subjected to fusion calculation to obtain a video frame feature summary map. And then carrying out K-means color clustering calculation based on the video frame characteristic abstract diagram and the RGB color space to obtain four types of results of four areas representing background, foreground character, character outline and noise, then respectively analyzing connected domains of the four types of results, deleting two areas of background and noise, and automatically obtaining the final outdoor scene text. The method has the advantages of simple operation steps, easy calculation, real-time identification and acquisition of the outdoor scene text, and good popularization and application prospects.

Description

Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram
Technical Field
The invention relates to a digital image processing method, in particular to an outdoor scene text automatic acquisition method in a video based on a characteristic abstract diagram, and belongs to the technical field of computer vision processing.
Background
In the last few years, content-based image understanding techniques have gained increasing attention as digital image capture devices, smart phones, and utility vision systems and devices become popular. Because the scene text in the image/video has rich and direct semantic information clues, the scene text is considered as an important object that must be detected and recognized. Text detection, positioning, extraction and identification are main steps for acquiring text information. The operations of text detection, location and extraction are commonly referred to collectively as text retrieval. Text acquisition is a very important prerequisite for text recognition, as it reduces the complex background, eliminates lighting effects, making recognition relatively simple and easy. However, due to uneven illumination indoors and outdoors, blurring of images/videos, complexity of background, perspective distortion, color diversity, complexity of fonts and difference of stroke widths, and other adverse factors, the acquisition of texts of video scenes is very challenging and severe.
At present, researchers at home and abroad have successfully developed various methods in the aspect of the acquisition technology of video scene texts. Now, the extraction of scene text is divided into two steps: (1) detecting and positioning scene texts, and (2) extracting the scene texts.
The scene text detection and positioning method in the prior art can be divided into the following steps: four different scene text detection methods, color-based, edge/gradient-based, texture-based, and stroke-based. Wherein:
color-based scene text detection: this is a traditional method that has been proposed for a long time and has been used for more than 20 years, which is simple and efficient: usually, a scene text detection algorithm based on a local threshold is adopted, and some researchers also adopt a local threshold acquisition method in an improved Niblack algorithm, so that the method can be used for quickly detecting some scene texts with simpler backgrounds. Researchers have also proposed using a mean shift (mean shift) algorithm to generate color layers in order to significantly improve the robustness of text detection in complex contexts. However, color feature based text detection suffers from a number of difficulties when multiple colors of characters and illumination non-uniformity are present in the video/image.
Edge/gradient-based scene text detection: assuming that a strong and symmetric change occurs in a text region displayed on a background region, pixels with large, symmetric gradient values can be regarded as text pixels, which can use edge features and gradient features in scene text detection. Researchers also propose a scene text detection algorithm based on edge enhancement. Such studies include clustering horizontally arranged "gradient vector streams" to find text candidate regions based on spatial constraints of size, location and color distance. Currently, researchers propose AdaBoost classifier-based scene text detection algorithms that combine gradient/edge features with various classifiers (e.g., artificial neural networks or AdaBoost algorithms); even further, a method for detecting the text locator based on the neural network is added on the basis of the AdaBoost classifier. However, such algorithms have difficulty detecting scene text in complex backgrounds with strong gradients.
Scene text detection based on texture: when the character regions are dense, the scene text may be considered as a kind of texture. Many current methods use texture feature extraction to detect scene text, including fourier transform, dct (discrete Cosine transform), wavelet, local Binary pattern lbp (local Binary pattern), histogram of Oriented gradients (hog), and so on. Although texture features can be used to efficiently detect dense characters, this approach may not be able to detect sparse characters. Therefore, researchers respectively propose a method for detecting scene texts based on fourier frequency domain features and a method for detecting scene texts based on DCT coefficients in a frequency domain. Recently, a scene text algorithm based on Local Haar Binary Pattern (Local Binary Pattern) feature detection is proposed. However, when the background to be presented is complicated, many background noises also show textures similar to texts, thus reducing the detection accuracy of the method.
The stroke-based scene text detection method comprises the following steps: the stroke Width transform swt (stroke Width transform) is used to calculate the most likely stroke pixel Width. Stroke-based features have proven to be very effective for the detection of high-resolution scene text, especially when combined with appropriate learning methods, or when stroke features are fused with other features including edge orientation difference eov (edge orientation variance), opposite edge pair oeps (porous edge pairs), or spatial-temporal analysis (spatial-temporal analysis). Recently, Bandlet-based edge detectors have been introduced to improve SWT, enhance the edge differences of scene text, and eliminate noisy edges, enabling SWT to be used in the detection of low-resolution text. However, when detecting scene text having characters of various sizes and fonts, the detection accuracy of this method is significantly reduced to a large extent.
The scene text extraction method in the prior art at least comprises the following steps: three text extraction algorithms, threshold-based, color-based, and character stroke-based. Wherein:
threshold-based text extraction algorithm: the method is divided into two subclasses of algorithms: one is to use a global threshold method, such as Otsu's algorithm, etc.; another is to use a local threshold method. Now, a multi-threshold algorithm is proposed: the second stage threshold in the algorithm depends on the threshold basis of the first stage, which significantly enhances the extraction. However, since the threshold-based method does not consider the characteristics of the scene text, the method is not satisfactorily performed and popularized.
Color-based text extraction algorithms: the method is to first generate several candidate binary images using k-means or other clustering algorithms and then select a binary image based on image analysis. The method is characterized in that the color of the text is assumed to be consistent, and color clustering is introduced into the extraction of the scene text. The disadvantages are that: because it belongs to a global calculation method, it is sensitive to non-uniform illumination, and the calculation cost and the selection of the parameter k in analyzing a plurality of candidate images are very complicated.
Text extraction algorithm based on character strokes: two sets of asymmetric Gabor filters are used to extract the texture direction and scale in the image, and these features are used to represent the edges of text characters most likely to enhance contrast. However, this algorithm is sensitive to the extracted character size and is not suitable for extracting scene text in video.
In summary, the above-mentioned detection and positioning of scene texts and the extraction technology of scene texts in the prior art have various unsatisfactory points, so how to develop a method for acquiring scene texts in a video with better performance or perfect characteristics becomes a new subject of great attention by technical personnel in the industry.
Disclosure of Invention
In view of this, an object of the present invention is to provide a method for automatically acquiring an outdoor scene text in a video based on a feature abstract diagram, which can better solve various defects in the prior art, and can correctly and completely acquire a scene text under various conditions, such as perspective distortion, various colors, complicated fonts, and unequal stroke widths, under an uneven lighting, fuzzy, or complex background. .
In order to achieve the above object, the present invention provides an outdoor scene text automatic acquisition method in a video based on a feature abstract diagram, which is characterized in that: the method comprises the following operation steps:
step 1, acquiring a video frame image of a scene text, and generating a video frame characteristic abstract map based on a red, green and blue (RGB) color space of the video frame image: firstly, respectively extracting four convolution maps including a horizontal direction, a vertical direction, a 45-degree direction and a 135-degree direction on an RGB color space to obtain four direction characteristic vectors for representing the RGB color space; then, carrying out product operation on the four direction feature vectors in pairs respectively to obtain ten saliency maps respectively representing video frames in different directions; then, performing fusion calculation on the ten saliency maps in different directions to obtain a video frame characteristic primary map which is used as a visual representation of a scene text in a subsequently obtained video, and deleting background and noise interference to improve the identification accuracy;
step 2, automatically acquiring a scene text: firstly, performing K-means color clustering calculation based on the video frame characteristic abstract diagram and an RGB color space, and subdividing the video frame abstract diagram into four types of results of four areas respectively representing background, foreground character, character outline and noise; and then, respectively carrying out connected domain analysis on the four types of results, deleting two areas of background and noise, and obtaining a final scene text.
At present, under the conditions of complex background and changeable illumination, the text of the outdoor video scene is very difficult to acquire. The method is an innovative method for automatically acquiring the outdoor scene text in the video, and has the technical key that how to acquire a brand-new video frame characteristic abstract diagram is provided and used as a visual representation and basis for automatically acquiring the scene text in the video; meanwhile, a color space based on a video frame characteristic abstract diagram and hue, Saturation and brightness HSV (hue Saturation value) is adopted for carrying out K-means color clustering, then analysis processing of a connected domain based on character stroke width and a connected domain based on geometric shape is respectively carried out, and after a background area and a noise area are deleted, a final video scene text can be quickly and automatically obtained.
Multiple simulation implementation tests prove that the method can well overcome the defects of the prior art, can still quickly and accurately automatically detect and extract the scene text when the outdoor video is in the environment with complex background, perspective deformation, various colors, uneven or strong illumination, complex fonts and different stroke widths, has simple operation steps, low calculation complexity and easy realization, and can meet the requirements of real-time identification and acquisition of the outdoor scene text, thereby having good popularization and application prospects.
Drawings
Fig. 1 is a flowchart of the operation steps of the method for automatically acquiring the text of the outdoor scene in the video based on the feature summary chart.
Fig. 2 is a flowchart of the operation steps of step 1 of the method for automatically acquiring the text of the outdoor scene according to the present invention.
Fig. 3 is a flowchart of the operation steps of step 2 of the method for automatically acquiring the text of the outdoor scene according to the present invention.
Fig. 4(a), (B), and (C) are schematic diagrams of an original image, a video frame feature summary diagram, and three steps of an obtained scene text embodiment of the outdoor scene text automatic acquisition method according to the embodiment of the present invention, respectively.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples.
Referring to fig. 1, the specific operation steps of the method for automatically acquiring the text of the outdoor scene in the video based on the characteristic abstract diagram are described as follows:
step 1, acquiring a video frame image of a scene text, and generating a video frame characteristic abstract map based on a red, green and blue (RGB) color space of the video frame image: firstly, respectively extracting four convolution maps including a horizontal direction, a vertical direction, a 45-degree direction and a 135-degree direction on an RGB color space to obtain four direction characteristic vectors for representing the RGB color space; then, carrying out product operation on the four direction feature vectors in pairs respectively to obtain ten saliency maps respectively representing video frames in different directions; and then performing fusion calculation on the ten salient images in different directions to obtain a video frame characteristic primary image which is used as a visual representation of a scene text in a subsequently obtained video, and deleting background and noise interference to improve the identification accuracy.
In four convolution graphs which are respectively extracted from an RGB color space and comprise four directions of horizontal direction, vertical direction, 45 degrees and 135 degrees, a horizontal direction convolution kernel adopted by the horizontal convolution graph is a calculation template for calculating horizontal direction differential in a Sobel operator:
Figure BDA0001515822010000051
the vertical convolution kernel adopted by the vertical convolution graph is a calculation template for calculating vertical differential in a Sobel operator:
Figure BDA0001515822010000052
the 45-degree direction convolution kernel adopted by the 45-degree convolution graph is a calculation template for calculating 45-degree direction differential:
Figure BDA0001515822010000053
the 135-degree direction convolution kernel adopted by the 135-degree convolution graph is a calculation template for calculating 135-degree direction differential:
Figure BDA0001515822010000061
the convolution kernel-based convolution graph feature extraction method has the characteristics of simple algorithm, high operation speed and contribution to engineering realization, and the extracted convolution features are not easily influenced by illumination change in an outdoor scene.
Step 2, automatically acquiring a scene text: firstly, performing K-means color clustering calculation based on the video frame characteristic abstract diagram and an RGB color space, and subdividing the video frame abstract diagram into four types of results of four areas respectively representing background, foreground character, character outline and noise; and then, respectively carrying out connected domain analysis on the four types of results, deleting two areas of background and noise, and obtaining a final scene text.
The color clustering calculation in the K-means clustering algorithm is realized by performing four-dimensional clustering calculation in a four-dimensional space of a video abstract chart, chromaticity, saturation and lightness according to the cosine distances of included angles between each pixel and four clustering central points. Because the illumination change in the outdoor environment is severe, each character presents different colors on the video frame, and the integrity of the extracted character strokes is seriously influenced; therefore, the four dimensions are adopted for clustering, and the influence of illumination change of outdoor environment on character color can be obviously reduced by adopting the calculation of an included angle cosine distance function which is different from the common Euclidean distance and does not pay attention to numerical value difference.
Referring to fig. 2, the specific operation content of step 1 in the above two operation steps is described in detail:
(11) firstly, respectively extracting horizontal direction convolution graphs R on red channelshVertical convolution graph Rv45 degree direction convolution graph RlAnd 135 degree direction convolution graph RrExtracting the horizontal convolution graph G on the green channel respectivelyhVertical convolution Gv45 degree direction convolution graph GlAnd 135 degree direction convolution graph GrExtracting the horizontal convolution images B on the blue channels respectivelyhVertical convolution map Bv45 degrees, respectivelyDirection convolution diagram BlAnd 135 degree direction convolution map Br(ii) a And then arranging the convolution maps in all directions according to an RGB color space to obtain four direction characteristic vectors for representing the RGB color space: horizontal direction feature vector H ═ Rh,Gh,BhR, a vertical direction feature vector V ═ Rv,Gv,BvR, a 45-degree direction feature vector L ═ Rl,Gl,BlR, a 135-degree direction feature vector R ═ Rr Gr,Br}。
(12) And performing multiplication operation on the four direction feature vectors by two vectors respectively to obtain ten significant images representing different directions of the video frame, so that the background and noise interference in other directions can be deleted while a plurality of edge features in the set direction are kept, the stroke features of the scene text in multiple directions can be obtained, and the automatic extraction of the scene text can be facilitated. This step (12) is subdivided into the following operations:
(120) according to formula Shh={Rh,Gh,Bh}×{Rh,Gh,BhCalculating the product of the squares of the horizontal direction feature vectors to obtain a horizontal direction saliency map ShhFor preserving and strengthening the edge features in the horizontal direction and weakening the edge features in other directions.
(121) According to formula Svv={Rv,Gv,Bv}×{Rv,Gv,BvCalculating the product of the squares of the feature vectors in the vertical direction to obtain a vertical saliency map SvvFor preserving and strengthening the edge features in the vertical direction and weakening the edge features in other directions.
(122) According to formula Sll={Rl,Gl,Bl}×{Rl,Gl,BlCalculating the product of the squares of the 45-degree direction feature vectors to obtain a 45-degree direction saliency map SllAnd the method is used for retaining and strengthening the edge characteristics in the 45-degree direction and weakening the edge characteristics in other directions.
(123) According to formula Srr={Rr,Gr,Br}×{Rr,Gr,BrCalculating the product of the squares of the 135-degree direction feature vectors to obtain a 135-degree direction saliency map SrrAnd is used for retaining and strengthening the edge characteristics in the 135-degree direction and weakening the edge characteristics in other directions.
(124) According to formula Shv={Rh,Gh,Bh}×{Rv,Gv,BvCalculating the product of multiplication of feature vectors in the horizontal direction and the vertical direction to obtain a saliency map S in the horizontal direction and the vertical directionhvThe method is used for retaining and strengthening the edge features in the horizontal and vertical directions and weakening the edge features in other directions.
(125) According to formula Shl={Rh,Gh,Bh}×{Rl,Gl,BlCalculating the product of multiplication of feature vectors in two directions of horizontal and 45 degrees to obtain a horizontal 45-degree direction saliency map ShlAnd the method is used for retaining and strengthening the edge characteristics in the horizontal 45-degree direction and weakening the edge characteristics in other directions.
(126) According to formula Shr={Rh,Gh,Bh}×{Rr,Gr,BrCalculating the product of multiplication of horizontal and 135-degree direction feature vectors to obtain a horizontal 135-degree direction saliency map ShrAnd the method is used for retaining and strengthening the edge characteristics in the horizontal 135-degree direction and weakening the edge characteristics in other directions.
(127) According to formula Svl={Rv,Gv,Bv}×{Rl,Gl,BlCalculating the product of multiplication of feature vectors in two directions of vertical and 45 degrees to obtain a saliency map S in the direction of vertical 45 degreesvlAnd the method is used for retaining and strengthening the edge characteristics in the vertical 45-degree direction and weakening the edge characteristics in other directions.
(128) According to formula Svr={Rv,Gv,Bv}×{Rr,Gr,BrCalculating the product of multiplication of feature vectors in two directions of vertical and 135 degrees to obtain a saliency map S in the direction of vertical 135 degreesvrAnd is used for retaining and strengthening the edge characteristics in the direction perpendicular to 135 degrees and weakening the edge characteristics in other directions.
(129) Push buttonAccording to formula Slr={Rl,Gl,Bl}×{Rr,Gr,BrCalculating the product of multiplication of feature vectors in two directions of 45 degrees and 135 degrees to obtain a 45-degree 135-degree direction saliency map SlrAnd the method is used for retaining and strengthening the edge characteristics in the 45-degree 135-degree direction and weakening the edge characteristics in other directions.
(13) And performing fusion calculation on the ten salient images in different directions to obtain a video frame characteristic abstract image, providing visual representation for subsequently acquiring a scene text in a video, deleting background and noise interference, and improving the accuracy and the integrity of an automatic acquisition result of the scene text.
In this step (13), the operation content of the fusion calculation is performed on the ten saliency maps in different directions: and (4) respectively performing corresponding operation of the maximum value, the minimum value and the average value of the same coordinate pixels in each image based on the ten saliency maps in different directions extracted in the step (12), and superposing the operation results to obtain a final video frame summary map fsg. The following operation of this step (13) is specifically described below:
(131) selecting the minimum value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form a minimum feature saliency map Smin(x,y)=min(pi(x, y)), wherein pi(x, y) is the coordinate (x, y) pixel value of each saliency map, the subscript i is the saliency map class, and i e { S ∈ { S }hh,Svv,Sll,Srr,Shv,Shl,Shr,Svl,Svr,SlrThe function min is the extraction pixel piThe operand of the (x, y) minimum value.
(132) Selecting the maximum value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form a maximum characteristic saliency map Smax(x,y)=max(pi(x, y)), where the function max is the extraction pixel pi(x, y) the arithmetic sign of the maximum value.
(133) Selecting the average value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form an average characteristic saliency map Smean(x,y)=mean(pi(x, y)), wherein the function mean is the extraction of the co-located pixel piThe operation sign of the (x, y) average value.
(134) In order to keep the integrity of the edge features of the characters in each direction as much as possible in the video frame abstract graph and reduce the illumination change influence easily appearing in the outdoor environment video, the minimum, maximum and average three feature saliency graphs capable of keeping the edge features in different directions are utilized to perform fusion calculation according to a formula:
Figure BDA0001515822010000081
a final video frame feature summary map fsg is obtained.
Referring to fig. 3 again, the specific operation content of step 2 in the above two operation steps is described in detail:
(21) based on a K-means clustering algorithm, carrying out color clustering calculation on a color space HSV (hue validation value) of the chroma, the Saturation and the lightness of a video frame feature summary map: and distinguishing the video frame feature abstract graph into four types of K mean value color clustering results respectively representing four areas of a background, foreground character, character outline and noise.
(22) Connected domain processing based on stroke width: and respectively calculating the stroke width of the edge pixels of each connected domain according to the four types of K-means color clustering results, analyzing each connected domain based on the stroke width, and deleting the background region and the noise region. This step (22) includes the following operations.
(221) Based on the ten saliency maps in step 1, calculating a gradient direction angle θ of each pixel in the video frame summary map:
Figure BDA0001515822010000091
(222) deleting connected domains connected with the upper, lower, left, right and periphery boundaries of the image in the video abstract picture because the character regions can not appear at the video image boundary or are connected with the video image boundary;
(223) and acquiring boundary pixels of each connected domain, searching each boundary pixel forwards according to the gradient direction angle theta of the boundary pixel until another boundary pixel is found, and setting the pixel values of the two boundary pixels as the stroke widths of the two pixels.
(224) Firstly, calculating the stroke widths of all boundary pixels of the same connected domain, and then calculating the variance of the stroke widths of all the boundary pixels; if the result value of the variance calculation is less than 0.5, the boundary pixel stroke width value of the connected domain is considered to be close to the actual value, and the boundary pixel stroke width value is kept as a candidate character region (the basis of the operation is that the region length-width ratios of the western and Chinese characters are relatively constant set values, namely the character stroke width values of the western or Chinese characters are close); otherwise, for the region with larger length-width ratio in the connected domain, it is considered not to belong to the character and deleted.
(23) For the less noisy regions that remain after the processing of step (22), a geometry-based connected-domain processing is performed: the area size of each connected domain in the character image, namely the number of pixels contained in the connected domain, is calculated respectively, and the connected domain with a smaller ratio which is regarded as a noise region is deleted, so that the target image quality is improved. The specific operation method of the connected domain processing is to calculate the length of the main axis, and if the length of the main axis is more than one third or less than one tenth of the image width of the video frame feature abstract image, the connected domain is considered to be too large or too small, does not belong to the character area and is deleted.
(24) Acquiring a scene text area: analyzing all connected domains in the four clustering results, merging the final connected domains reserved by all clusters into an image, and judging the similar connected domains as the same region according to two measures of the distance and the stroke width of each connected domain, thereby obtaining the final video scene text.
The method of the invention has been carried out a plurality of times of simulation implementation tests, and the test results are successful. Referring to fig. 4(a), (B), and (C), the three diagrams respectively show the original video frame, the video frame feature summary map obtained in step 1, and the operation results in step 2 in an embodiment of the method of the present invention: and acquiring an example schematic diagram of the outdoor scene text in the video. That is to say, the input is a video frame containing a scene text, and after the processing by the method of the present invention, the video frame is output as an acquired complete scene text, which can be used for the subsequent scene text recognition. Therefore, the invention can well realize the purpose of the invention and has good popularization and application prospect.

Claims (9)

1. A method for automatically acquiring an outdoor scene text in a video based on a characteristic abstract diagram is characterized by comprising the following steps: the method comprises the following operation steps:
step 1, acquiring a video frame image of a scene text, and generating a video frame characteristic abstract map based on a red, green and blue (RGB) color space of the video frame image: firstly, respectively extracting four convolution maps including a horizontal direction, a vertical direction, a 45-degree direction and a 135-degree direction on an RGB color space to obtain four direction characteristic vectors for representing the RGB color space; then, carrying out product operation on the four direction feature vectors in pairs respectively to obtain ten saliency maps respectively representing video frames in different directions; then, performing fusion calculation on the ten saliency maps in different directions to obtain a video frame characteristic summary map;
step 2, automatically acquiring a scene text: firstly, performing K-means color clustering calculation on a color space HSV based on the video frame characteristic abstract map by adopting a K-means clustering algorithm, and subdividing the video frame abstract map into four types of results of four regions respectively representing a background, foreground character, character outline and noise; then, performing connected domain analysis on the four types of results respectively, deleting two areas of background and noise, and acquiring a final scene text;
wherein, the obtaining of the four directional feature vectors for representing the RGB color space includes: respectively extracting horizontal direction convolution images R on red channelshVertical convolution graph Rv45 degree direction convolution graph RlAnd 135 degree direction convolution graph RrExtracting the horizontal convolution graph G on the green channel respectivelyhVertical convolution Gv45 degree direction convolution graph GlAnd 135 degree direction convolution graph GrExtracting the horizontal convolution images B on the blue channels respectivelyhVertical convolution map Bv45 degree direction convolution graph BlAnd 135 degree direction convolution map Br(ii) a And then arranging the convolution maps in all directions according to an RGB color space to obtain four direction characteristic vectors for representing the RGB color space: horizontal direction feature vector H ═ Rh,Gh,BhR, a vertical direction feature vector V ═ Rv,Gv,BvR, a 45-degree direction feature vector L ═ Rl,Gl,BlR, a 135-degree direction feature vector R ═ Rr Gr,Br}。
2. The method of claim 1, wherein: in the four convolution diagrams respectively extracted from the RGB color space and including the horizontal direction, the vertical direction, the 45-degree direction and the 135-degree direction, a horizontal convolution kernel adopted by the horizontal convolution diagram is a calculation template for calculating a horizontal differential in a Sobel operator:
Figure FDA0002946807050000011
the vertical direction convolution kernel adopted by the vertical direction convolution graph is a calculation template for calculating vertical direction differential in a Sobel operator:
Figure FDA0002946807050000021
the 45-degree direction convolution kernel adopted by the 45-degree direction convolution diagram is a calculation template for calculating 45-degree direction differential:
Figure FDA0002946807050000022
the 135-degree direction convolution kernel adopted by the 135-degree direction convolution graph is a calculation template for calculating 135-degree direction differential:
Figure FDA0002946807050000023
3. the method of claim 1, wherein: the process of obtaining ten saliency maps representing video frames in different directions in step 1 includes:
(120) according to formula Shh={Rh,Gh,Bh}×{Rh,Gh,BhCalculating the product of the squares of the horizontal direction feature vectors to obtain a horizontal direction saliency map Shh
(121) According to formula Svv={Rv,Gv,Bv}×{Rv,Gv,BvCalculating the product of the squares of the feature vectors in the vertical direction to obtain a vertical saliency map Svv
(122) According to formula Sll={Rl,Gl,Bl}×{Rl,Gl,BlCalculating the product of the squares of the 45-degree direction feature vectors to obtain a 45-degree direction saliency map Sll
(123) According to formula Srr={Rr,Gr,Br}×{Rr,Gr,BrCalculating the product of the squares of the 135-degree direction feature vectors to obtain a 135-degree direction saliency map Srr
(124) According to formula Shv={Rh,Gh,Bh}×{Rv,Gv,BvCalculating the product of multiplication of feature vectors in the horizontal direction and the vertical direction to obtain a saliency map S in the horizontal direction and the vertical directionhv
(125) According to formula Shl={Rh,Gh,Bh}×{Rl,Gl,BlCalculating the product of multiplication of feature vectors in two directions of horizontal and 45 degrees to obtain a horizontal 45-degree direction saliency map Shl
(126) According to formula Shr={Rh,Gh,Bh}×{Rr,Gr,BrCalculating the product of multiplication of horizontal and 135-degree direction feature vectors to obtain a horizontal 135-degree direction saliency map Shr
(127) According to formula Svl={Rv,Gv,Bv}×{Rl,Gl,BlCalculating the product of multiplication of feature vectors in two directions of vertical and 45 degrees to obtain a saliency map S in the direction of vertical 45 degreesvl
(128)According to formula Svr={Rv,Gv,Bv}×{Rr,Gr,BrCalculating the product of multiplication of feature vectors in two directions of vertical and 135 degrees to obtain a saliency map S in the direction of vertical 135 degreesvr
(129) According to formula Slr={Rl,Gl,Bl}×{Rr,Gr,BrCalculating the product of multiplication of feature vectors in two directions of 45 degrees and 135 degrees to obtain a 45-degree 135-degree direction saliency map Slr
4. The method of claim 1, wherein: in the step 1, performing fusion calculation on the ten saliency maps in different directions to obtain a video frame feature abstract map includes: and respectively performing corresponding operation on the maximum value, the minimum value and the average value of the same coordinate pixels in each image based on the ten saliency maps in different directions, and superposing the obtained operation results to obtain a final video frame summary map fsg.
5. The method of claim 4, wherein: the obtaining of the final video frame summary map fsg includes, after the corresponding operations of the maximum value, the minimum value, and the average value of the same coordinate pixels in each image are respectively performed on the ten saliency maps based on different directions and the operation results are superimposed:
(131) selecting the minimum value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form a minimum feature saliency map Smin(x,y)=min(pi(x, y)), wherein pi(x, y) is the coordinate (x, y) pixel value of each saliency map, the subscript i is the saliency map class, and i e { S ∈ { S }hh,Svv,Sll,Srr,Shv,Shl,Shr,Svl,Svr,SlrThe function min is the extraction pixel piThe operation sign of the (x, y) minimum value:
(132) selecting the maximum value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form the maximum valueLarge feature saliency map Smax(x,y)=max(pi(x, y)), where the function max is the extraction pixel piOperation sign of (x, y) maximum value:
(133) selecting the average value of each pixel positioned at the same coordinate in the ten saliency maps in different directions for fusion calculation to form an average characteristic saliency map Smean(x,y)=mean(pi(x, y)), wherein the function mean is the extraction of the co-located pixel piOperation sign of (x, y) average value:
(134) and performing fusion calculation by using the minimum feature saliency map, the maximum feature saliency map and the average feature saliency map according to a formula:
Figure FDA0002946807050000031
a final video frame feature summary map fsg is obtained.
6. The method of claim 1, wherein: the step 2 comprises the following operation contents:
(21) carrying out color clustering calculation on the hue space HSV of the chroma, saturation and lightness of the video frame characteristic abstract diagram based on a K-means clustering algorithm: dividing the video frame characteristic abstract graph into four regions respectively representing a background, foreground characters, character outlines and noise, and performing color clustering calculation on the four regions respectively to obtain four types of K-means color clustering results of the four regions;
(22) connected domain processing based on stroke width: respectively calculating the edge pixel stroke width of each connected domain according to the four types of K-means color clustering results, analyzing each connected domain based on the stroke width, and deleting the background region and the noise region;
(23) for the less noisy regions that remain after the processing of step (22), a geometry-based connected-domain processing is performed: respectively calculating the number of pixels contained in each connected domain in the character image, and deleting the connected domain with a smaller ratio which is regarded as a noise region;
(24) acquiring a scene text area: analyzing all connected domains in the four types of K-means color clustering results, combining the final connected domains reserved by all clusters into an image, and judging the similar connected domains as the same region according to two measures of the distance and the stroke width of each connected domain, thereby obtaining the final video scene text.
7. The method of claim 6, wherein: the color clustering calculation of the hue space HSV of the chroma, the saturation and the lightness of the video frame characteristic abstract diagram based on the K-means clustering algorithm comprises the following steps: and performing four-dimensional clustering according to the cosine distance of an included angle between each pixel and the central point of four types of K-means color clusters representing four areas of the background, the foreground character, the character outline and the noise.
8. The method of claim 6, wherein: the step (22) comprises the following operations;
(221) based on the ten saliency maps in step 1, calculating a gradient direction angle θ of each pixel in the video frame summary map:
Figure FDA0002946807050000041
(222) deleting connected domains connected with the upper, lower, left and right boundaries of the image in the video abstract picture;
(223) acquiring boundary pixels of each connected domain, searching each boundary pixel forwards according to the gradient direction angle theta of the boundary pixel until another boundary pixel is found, and setting pixel values in the two acquired boundary pixels as stroke widths of the two boundary pixels;
(224) firstly, calculating the stroke widths of all boundary pixels of the same connected domain, and then calculating the variance of the stroke widths of all the boundary pixels; if the numerical value of the calculated variance is less than 0.5, the stroke width value of the boundary pixel of the connected domain is considered to be close to the actual numerical value, and the character area is reserved as a candidate; otherwise, it is deleted as not belonging to the character.
9. The method of claim 8, wherein: deleting connected domains connected with the upper, lower, left and right boundaries of the image in the video summary map, wherein the deleting comprises the following steps:
deleting an area with a large length-width ratio in the connected domain; and the connected domain is obtained by calculating the length of the main shaft, and if the length of the main shaft is more than one third or less than one tenth of the image width of the video frame characteristic abstract image, the connected domain is considered not to belong to the character region and is deleted.
CN201711381971.5A 2017-12-20 2017-12-20 Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram Active CN108038458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711381971.5A CN108038458B (en) 2017-12-20 2017-12-20 Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711381971.5A CN108038458B (en) 2017-12-20 2017-12-20 Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram

Publications (2)

Publication Number Publication Date
CN108038458A CN108038458A (en) 2018-05-15
CN108038458B true CN108038458B (en) 2021-04-09

Family

ID=62099983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711381971.5A Active CN108038458B (en) 2017-12-20 2017-12-20 Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram

Country Status (1)

Country Link
CN (1) CN108038458B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829458B (en) * 2019-01-14 2023-04-04 上海交通大学 Method for automatically generating log file for recording system operation behavior in real time
CN110347870A (en) * 2019-06-19 2019-10-18 西安理工大学 The video frequency abstract generation method of view-based access control model conspicuousness detection and hierarchical clustering method
CN110472550A (en) * 2019-08-02 2019-11-19 南通使爱智能科技有限公司 A kind of text image shooting integrity degree judgment method and system
CN113192033B (en) * 2021-04-30 2024-03-19 深圳市创想三维科技股份有限公司 Wire drawing judging method, device and equipment in 3D printing and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101276461A (en) * 2008-03-07 2008-10-01 北京航空航天大学 Method for increasing video text with edge characteristic
CN101515325A (en) * 2009-04-08 2009-08-26 北京邮电大学 Character extracting method in digital video based on character segmentation and color cluster
CN104751153A (en) * 2013-12-31 2015-07-01 中国科学院深圳先进技术研究院 Scene text recognizing method and device
WO2017089865A1 (en) * 2015-11-24 2017-06-01 Czech Technical University In Prague, Department Of Cybernetics Efficient unconstrained stroke detector
CN106874905A (en) * 2017-01-12 2017-06-20 中南大学 A kind of method of the natural scene text detection based on self study Color-based clustering
CN107066972A (en) * 2017-04-17 2017-08-18 武汉理工大学 Natural scene Method for text detection based on multichannel extremal region

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101276461A (en) * 2008-03-07 2008-10-01 北京航空航天大学 Method for increasing video text with edge characteristic
CN101515325A (en) * 2009-04-08 2009-08-26 北京邮电大学 Character extracting method in digital video based on character segmentation and color cluster
CN104751153A (en) * 2013-12-31 2015-07-01 中国科学院深圳先进技术研究院 Scene text recognizing method and device
WO2017089865A1 (en) * 2015-11-24 2017-06-01 Czech Technical University In Prague, Department Of Cybernetics Efficient unconstrained stroke detector
CN106874905A (en) * 2017-01-12 2017-06-20 中南大学 A kind of method of the natural scene text detection based on self study Color-based clustering
CN107066972A (en) * 2017-04-17 2017-08-18 武汉理工大学 Natural scene Method for text detection based on multichannel extremal region

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Automatic video superimposed text detection based on Nonsubsampled Contourlet Transform;xiaodong huang;《Multimedia Tools and Applications 》;20170325;第7033-7049页 *
Video Text Detection Based on Text Edge Map;xiao dong huang et al.;《2013 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT)》;20131013;第1003-1007页 *
Video Text Extraction Based on Stroke Width and Color;xiaodong huang et al.;《PROCEEDINGS OF 3RD INTERNATIONAL CONFERENCE ON MULTIMEDIA TECHNOLOGY (ICMT-13)》;20131130;第84卷;第629-636页 *
基于特征融合的视频文本获取研究;黄晓冬;《中国博士学位论文全文数据库 信息科技辑》;20120115(第1期);第I138-74页 *

Also Published As

Publication number Publication date
CN108038458A (en) 2018-05-15

Similar Documents

Publication Publication Date Title
CN104751142B (en) A kind of natural scene Method for text detection based on stroke feature
CN108038458B (en) Method for automatically acquiring outdoor scene text in video based on characteristic abstract diagram
Wang et al. Character location in scene images from digital camera
Phan et al. A Laplacian method for video text detection
Lu et al. Salient object detection using concavity context
CN107688806B (en) Affine transformation-based free scene text detection method
US8494297B2 (en) Automatic detection and mapping of symmetries in an image
WO2018145470A1 (en) Image detection method and device
CN108537239B (en) Method for detecting image saliency target
CN109409356B (en) Multi-direction Chinese print font character detection method based on SWT
Katramados et al. Real-time visual saliency by division of gaussians
JPH10149449A (en) Picture division method, picture identification method, picture division device and picture identification device
Youlian et al. Face detection method using template feature and skin color feature in rgb color space
CN109448010B (en) Automatic four-side continuous pattern generation method based on content features
US20200274990A1 (en) Extracting a document page image from a electronically scanned image having a non-uniform background content
Yang et al. Caption detection and text recognition in news video
Singh et al. An efficient hybrid scheme for key frame extraction and text localization in video
Huang et al. An efficient method of license plate location in natural-scene image
Hu et al. Video text detection with text edges and convolutional neural network
Pan et al. Shadow detection in remote sensing images based on weighted edge gradient ratio
CN111415372A (en) Moving target merging method based on HSI color space and context information
CN110781977A (en) Motion shadow detection method and system based on spatial correlation and extreme learning machine
Zhang et al. Arbitrarily oriented text detection using geodesic distances between corners and skeletons
Long et al. An Efficient Method For Dark License Plate Detection
CN111325209A (en) License plate recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant