CN108038486A

CN108038486A - A kind of character detecting method

Info

Publication number: CN108038486A
Application number: CN201711267804.8A
Authority: CN
Inventors: 巫义锐; 黄多辉; 冯钧
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2018-05-15

Abstract

The invention discloses a kind of character detecting method, this method includes：The extremal region of word picture to be detected is extracted, extremal region is filtered, obtains character candidates region；MSSH features, depth convolution feature are calculated, by own coding neutral net by MSSH features, depth convolution Fusion Features, obtains fusion feature；Character zone is further filtered out from character candidates region according to fusion feature；Merge all character zones and obtain final character area.Detection method has very strong robustness, and detection efficiency is high, can be rapidly completed text detection task.

Description

Character detection method

Technical Field

The invention relates to a character detection method.

Background

Characters play an important role in human life as one of the most influential inventions of human beings. The abundant and accurate information contained in the characters has great significance for natural scene understanding application based on visual semantics. More and more multimedia applications, such as street scene understanding, unmanned vehicle understanding of traffic signs, and semantic-based image retrieval, require accurate and robust text detection. The basic task of text detection is to determine whether text is present in the scene image and video, and if so, to mark its location. In recent years, as the capability and number of image capturing devices have increased, the number of images and videos containing scene text has increased dramatically compared to the past. Therefore, there has been an increasing interest in text detection in images and videos of natural scenes. With the gradual and intensive research of computer vision related technology, how to utilize computer algorithm to detect scene characters has become one of important and active international leading topics.

Scene text detection and recognition of low quality and complex backgrounds is extremely challenging. Scene characters often have the characteristics of low resolution, complex background, arbitrary direction, perspective deformation, uneven illumination and the like, and document characters have a uniform format and a single background.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a character detection method, and solves the technical problems of low success rate and low robustness of character detection in the prior art.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a character detection method comprises the following steps:

extracting an extreme value area of the character picture to be detected, and filtering the extreme value area to obtain a character candidate area;

calculating MSSH characteristics and deep convolution characteristics, and fusing the MSSH characteristics and the deep convolution characteristics through a self-coding neural network to obtain fused characteristics;

further screening out a character region from the character candidate region according to the fusion characteristics;

and merging all the character areas to obtain a final character area.

The specific method for extracting the extremum region is as follows:

converting the character picture to be detected into a gray level image I _gray R value graph I _R G-value graph I _G And B-value graph I _B ；

Are respectively to I _R ，I _G ，I _B Obtaining an extremum region specifically as follows:

r value graph I _R Extreme value region A of _R Is defined as:

wherein I _R (p) representing the value of pixel p in the R-value graph; i is _R (q) representing the value of a pixel point q in the R value graph; θ represents a threshold value of the extremum region;representation and extremum area A _R Adjacent but not belonging to the extremum region A _R A set of pixels of (a);

g-value graph I _G Extreme value region A of _G Is defined as:

wherein I _G (p) representing the value of pixel p in the G-value graph; I.C. A _G (q) representing the value of a pixel point q in the G-value graph; theta denotes the threshold value of the extremum region,representation and extremum area A _G Adjacent but not belonging to the extremum region A _G A set of pixels of (a);

b-value graph I _B Extreme value region A of _B Is defined as follows:

wherein I _B (p) representing the value of pixel p in the B-value map; i is _B (q) representing the value of a pixel point q in the B-value graph; theta denotes the threshold value of the extremum region,representation and extremum region A _B Adjacent but not belonging to the extremum region A _B The set of pixels of (a).

The method for acquiring the character candidate region comprises the following steps:

calculating the area S, the perimeter C, the Euler number E and the pixel value variance H of each extreme value region, wherein the pixel value variance H is obtained by passing through a gray level image I _gray Calculated, the calculation formula is as follows:

wherein: x represents a pixel point; I.C. A _gray (x) Representing the gray value of the pixel point x; a represents a color interval with the maximum number of pixels in the extreme value area; b represents a color interval with a plurality of pixels in the extreme value area; n is a radical of an alkyl radical _a Representing the number of pixels in the color interval a in the extremum region; n is a radical of an alkyl radical _b Representing the number of pixels in the color interval b in the extremum region; r _a Representing a set of pixels in the color interval a in the extremum region; r is _b Representing a set of pixels in the extremum region in the color interval b; mu.s _a Representing an average value of pixel values in the color interval a in the extremum region; mu.s _b Representing an average value of pixel values in the color interval b in the extremum region;

redundant extremum regions are filtered through the area S, the perimeter C, the Euler number E and the pixel value variance H of each extremum region, and the rest regions after the redundant extremum regions are filtered are character candidate regions, wherein the filtering conditions are as follows:

wherein S is ₀ A threshold value representing an area S of the extremum region; c ₀ A threshold value representing the perimeter of the extremum region; e ₀ A threshold value representing an extreme region Euler number; h ₀ A threshold value representing the variance of the extremum region pixel values.

The specific method for calculating the MSSH characteristics is as follows:

acquiring a stroke pixel pair and a stroke line segment of a character candidate area;

calculating the symmetrical feature description value of a certain stroke pixel pair in the character candidate region on the gray value and gradient attributes;

calculating the symmetrical characteristic description of all stroke line segments in the character candidate area on stroke width values, stroke sequence value distribution and low-frequency mode attributes;

connecting the characteristic values of different symmetric attributes to form MSSH characteristics, wherein the specific formula is as follows:

F _m (e _i )＝[F _j |＝V,G _m ,G _o ,Sw,Md,Pa]

wherein: f _m (e _i ) Values representing MSSH feature vectors; []Representing a vector join operation; e.g. of a cylinder _i Representing an ith character candidate region; f _j Representing a feature vector corresponding to the symmetric attribute; j represents a specific type of symmetry property; v represents a gray value; g _m Representing a gradient magnitude attribute; g _o Representing a gradient direction attribute; sw represents a stroke width value; md represents the stroke sequence value distribution; pa denotes a low frequency mode attribute.

The specific method for acquiring the stroke pixel pairs and the stroke line segments of the character candidate area comprises the following steps:

outputting an edge image by using a Canny edge detection operator;

calculating the gradient direction of a certain pixel point p on the stroke edge image;

following the ray r determined by the gradient direction until the ray meets another stroke edge pixel point q;

the stroke pixel pair is defined as { p, q }, and the stroke line segment is defined as the distance of ray r between pixel points p and q.

The specific method for calculating the deep convolution characteristics is as follows:

adjusting the size of the character candidate area to 64 x 64 pixel values;

constructing a convolutional neural network model comprising three stages;

the first-stage construction method comprises the following steps:

in the first stage, two convolution layers and a maximum pooling layer are sequentially used, wherein the convolution layers all adopt 32 convolution kernels with the size of 3 × 3, 1 pixel is displacement offset, and convolution operation is carried out on the convolution kernels and the character candidate area, and the specific formula is as follows:

wherein g (a, b, k) represents the value of the pixel value of the line a and the column b in the character candidate area after the k convolution operation; e.g. of a cylinder _i (a + m, b + n) represents the (a + m) th row and (b + n) th column pixel value in the ith character candidate region; m represents the row offset of the pixel, n represents the column offset of the pixel, and the value set is { -1,0,1}, h _k Represents the kth convolution kernel; after each convolution layer is operated, a nonlinear activation function is used for calculating an activation value, and the specific formula is as follows:

f(a,b,k)＝max(0,g(a,b,k))

f (a, b, k) represents an activation value of a line a and a column b pixel values in the character candidate area after a kth convolution operation; max () represents a large value taking function;

the activation value is then transmitted to a maximum pooling layer, which takes 2 pixels as a stride and takes the maximum value in a 2 × 2 spatial neighborhood as an output value;

the architecture of the second stage is the same as that of the first stage;

the three-stage sequence uses three convolutional layers, a maximum pooling layer and a full-link layer, wherein the full-link layer connects the output of the maximum pooling layer into a one-dimensional vector as input and controls the output to be 128-dimensional, and the formula can be expressed as follows:

F _d ＝W·X+B

wherein: f _d For the generated 128-dimensional depth convolution characteristics, X is a one-dimensional vector obtained after connecting the output of the maximum pooling layer, W is a weight matrix, and B is an offset vector;

training and testing the convolution neural network model, and determining the unknown parameter h through training _k W and B, F generated by testing _d As a deep convolution feature of the character candidate region.

The method for acquiring the fusion characteristics comprises the following steps:

weights ω Using trained convolutional neural network model _d As a deep convolution feature F _d The initial fusion weight value of (1);

for MSSH feature F _m Predicting initial fusion weight value omega of logistic regression model _m And reducing the size of the characteristic dimension, wherein the specific process can be represented by the following formula:

wherein the content of the first and second substances,representing MSSH features after dimensionality reduction, e _i Representing an ith character candidate region; function f _τ () Representing a logistic regression model representing a small data set used to train the feature initial weight values;

generating fusion feature F _s Can be represented by the following formula:

wherein the function f _μ () It is shown that a self-encoded network,and F _d Remain dimensionally consistent.

In the fusion training process, when the verification error rate stops decreasing, the joint training process of the self-coding network is finished.

The specific method for merging character areas is as follows:

assuming that the character area is S, all S are calculated _i C is the center point of S _i ；

For any two character region s _i ,s _j E S, if the Euler distance between the two central points is less than a threshold value F, connecting a straight line l between the two central points _i, ；

Calculating included angles alpha between all straight lines and the horizontal line, and taking the mode alpha of all included angles _mode (ii) a With a remaining angle in the interval [ alpha ] _mode -π/6,α _mode +π/6]Removing the inner straight line and the other straight lines;

and combining character areas connected by straight lines to obtain a final character area.

Compared with the prior art, the invention has the following beneficial effects:

1. describing the character candidate area by using MSSH characteristics and depth convolution characteristics, wherein the MSSH characteristics are based on edge images and have strong robustness on low resolution, picture rotation, affine deformation and multi-language multi-font change; manual intervention is not needed in the deep convolution feature construction process, the appearance attribute of the character candidate area is strongly described, the overall appearance change of the picture is not large in the low-resolution, picture rotation and illumination change processes, and the robustness is also strong;

(2) The self-coding network used by the invention does not need manual intervention, can automatically fuse MSSH characteristics and deep convolution characteristics, and the generated fusion characteristics can integrate the advantages of all characteristics and have strong robustness on low resolution, picture rotation, affine deformation and complex background.

(3) The method for detecting the characters in the natural scene has high efficiency, has low computational algorithm complexity, and can quickly complete the character detection process.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of the computation of the deep convolution feature of FIG. 1;

FIG. 3 is a flow chart of feature fusion in FIG. 1;

FIG. 4 is a diagram of a text to be detected;

FIG. 5 is a picture of the character candidate region from FIG. 4 filtered by the extremum region;

FIG. 6 is the character region from FIG. 5 after feature fusion;

fig. 7 is a view showing a character area obtained by combining the character areas shown in fig. 6.

Detailed Description

The invention provides a character detection method, which comprises the steps of obtaining a character candidate region by extracting and filtering an extreme value region, further screening out a character region from the character candidate region through MSSH (minimum shift keying) feature and depth convolution feature fusion, and finally obtaining a character region through character region combination. The detection method has the advantages of strong robustness and high detection efficiency, and can quickly complete the character detection task.

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

As shown in fig. 1, which is a flow chart of the present invention, the method of the present invention specifically includes the following steps:

the method comprises the following steps: inputting a character picture to be detected, and extracting an extreme value area of the character picture to be detected;

firstly, converting input RGB color image into gray-scale image I _gray R-value diagram I of red component diagram _R Green component map G value map I _G And blue component map B value map I _B ；

Secondly, respectively to I _R ，I _G ，I _B Obtaining an extremum region A _R ，A _G ，A _B The extremum region refers to a region where the pixel value at the outer boundary of the region is strictly larger than the pixel value in the region, and is represented by an R-value graph I _R For example, the extremum region A _R Can be defined as:

in which I _R (p) and I _R (q) each represents I _R The values of the middle pixel points p and q, theta represents the threshold of the extremum region,representation and extremum area A _R Adjacent to but not belonging to the extremum region A _R A set of pixels of (a);

then, the area S, the perimeter C, the Euler number E and the pixel value variance H of each extremum region are calculated, wherein the pixel value variance H is obtained through the gray-scale image I _gray Calculated, the calculation formula is as follows:

wherein x represents a pixel point, I _gray (x) Expressing the gray value of the pixel point x, a and b are respectively the color interval with the maximum number of pixels in the extreme value area and the color interval with the multiple number of pixels, n _a And n _b Respectively representing the number of pixels in the color regions a and b, R _a And R _b Respectively representing the sets of pixels, μ, in the extremum regions in the color intervals a and b _a And mu _b Which represent the average values of the pixel values in the color intervals a and b, respectively, in the extremum region.

Step two: filtering the extreme value region to obtain a character candidate region;

redundant extremum regions are filtered through the area S, the perimeter C, the Euler number E and the pixel value variance H of each extremum region, and the rest of the filtered redundant extremum regions are character candidate regions. The filtration conditions were as follows:

wherein S is ₀ ,C ₀ ,E ₀ ,H ₀ Are thresholds statistically derived from a large number of character and non-character regions. S ₀ Threshold representing the area S of the extremal region S ₀ The specific values are within the interval [80, 120; c ₀ Threshold value representing the perimeter of the extremum region, C ₀ The specific values are within the interval [30, 50; e ₀ Threshold value representing the Euler number of the extremum region, E ₀ The specific value is in the interval [0,1 ]]Internal; h ₀ Threshold, H, representing the variance of the pixel values of the extremum region ₀ The specific value is in the interval of [100,200 ]]And (4) the following steps.

Fig. 4 is an inputted text picture to be detected, and as shown in fig. 5, the text picture is a picture of the character candidate region obtained from fig. 4 after the extreme value region filtering.

Step three: calculating MSSH characteristics and depth convolution characteristics;

the specific method for calculating the MSSH characteristics is as follows:

obtaining stroke pixel pairs and stroke line segments of the character candidate area through an SWT algorithm, and the steps are as follows:

(1) Outputting an edge image by using a Canny edge detection operator;

(2) Calculating the gradient direction of a certain pixel p on the stroke edge image;

(3) Following the ray r determined by the gradient direction until the ray meets another stroke edge pixel point q;

(4) The stroke pixel pair is defined as { p, q }, and the stroke line segment is defined as the distance of ray r between pixel points p and q.

The Canny edge detection algorithm has the following steps:

(1) Converting the character candidate area into a gray scale map;

(2) Performing Gaussian filtering on the obtained gray level image;

(3) Calculating the amplitude and direction of the gradient;

(4) Carrying out non-maximum suppression on the gradient amplitude;

(5) Edges are detected and connected using a dual threshold algorithm.

assuming { p, q } is a certain stroke pixel pair in the character candidate region, the calculation of the symmetric attribute description value based on the stroke pixel pair is as follows:

(1) The characteristic value F of the stroke pixel pair { p, q } on the gray value and gradient size attributes is calculated by the following formula _j (p,q) ₁ ：

F _j (p,q) ₁ ＝f _h (|I _j (p)-I _j (q)|)if∈{V,G _m }

Wherein, I _j (p) value, I, of pixel p on symmetry property j _j (q) representing the value of the pixel point q on the symmetric attribute j; j represents a specific type of symmetry property; (ii) a { V, G _m Denotes the gray value and gradient magnitude attributes, respectively, function f _h () Representing a histogram statistical operation.

(2) We calculate the eigenvalue F of the stroked pixel pair { p, q } on the gradient direction property by the following formula _j (p,q) ₂ ：

F _j (p,q) ₂ ＝f _h (cos＜I _j (p), _j (q)＞)j＝G _o

Wherein, G _o Refers to the property of gradient direction, cos<&gt represents an inverse cosine function, function f _h () Representing a histogram statistical operation.

Calculating the symmetrical feature description of all stroke line segments in the character candidate region on stroke width values, stroke sequence value distribution and low-frequency mode attributes;

assuming that s represents a set of stroked line segments within a character candidate region, the calculation of the symmetry-attribute-describing value based on the set of stroked line segments is as follows:

(1) Calculating the characteristic value F of the stroke pixel pair on the attribute of the gradient direction by the following formula _j (s)：

F _j (s)＝f _h (f _ξ (s,j))j∈{Sw,Md,Pa}

Wherein the function f _h () Representing histogram statistical operation, and { Sw, md, pa } representing symmetry-of-class attributes, including stroke width value Sw, stroke sequence value distribution Md and low-frequency mode attribute Pa, where function f _ξ (s, j) can be defined as:

wherein | | | refers to the Euclidean distance, D _s And M _s Respectively representing the gray value variance and the mean value of the pixels contained in the stroke line segment set s,representing Haar wavelet transform, k represents the number of wavelet transform layers, n _l Representing the highest scale layer number, ω _k Are predefined weight parameters. Wherein n is _l Is 1. When k =0, ω _k The specific numerical value of (a) is 0.1; when k =1, ω _k The specific numerical value of (a) is 0.3; when k =2, ω _k The specific value of (3) is 0.5.

(2) And scaling the attribute values of a certain character candidate region to be between 0 and 1 in an equal proportion according to the stroke width value, the stroke sequence value distribution and the symmetrical characteristic description value on the low-frequency mode attribute.

F _m (e _i )＝[F _j |＝V,G _m ,o,w,Md,a]

wherein: f _m (e _i ) Values representing MSSH feature vectors; []Representing a vector join operation; e.g. of the type _i Representing the ith character candidate area; f _j Representing symmetric propertiesThe corresponding feature vector; j represents a specific type of symmetry property; v represents a gray value; g _m Representing a gradient magnitude attribute; g _o Representing a gradient direction property; sw represents a stroke width value; md represents the stroke sequence value distribution; pa denotes a low frequency mode attribute.

As shown in fig. 2, it is a flowchart for calculating depth convolution features, and the method for calculating depth convolution features is as follows:

adjusting the size of the character candidate area to 64 x 64 pixel values;

constructing a convolutional neural network model comprising three stages;

the first-stage construction method comprises the following steps:

in the first stage, two convolutional layers and a maximum pooling layer are sequentially used, wherein the convolutional layers all adopt 32 convolutional kernels with the size of 3 × 3, 1 pixel is a displacement offset, and the convolutional layers and the character candidate area are subjected to convolution operation, and the specific formula is as follows:

wherein g (a, b, k) represents the value of the pixel value of the line a and the column b in the character candidate area after the k convolution operation; e.g. of the type _i (a + m, b + n) represents the (a + m) th row and (b + n) th column pixel value in the ith character candidate region; m represents the row offset of the pixel, n represents the column offset of the pixel, and the value set is { -1,0,1}, h _k Represents the kth convolution kernel; after each convolution layer is operated, a nonlinear activation function is used for calculating an activation value, and the specific formula is as follows:

f(a,b,k)＝max(0,g(a,b,k))

f (a, b, k) represents an activation value of a line a and a column b pixel values in the character candidate area after the k convolution operation; max () represents a take large value function;

the structure of the second stage is the same as that of the first stage;

the three-stage sequence uses three convolutional layers, a maximum pooling layer and a full-link layer, wherein the full-link layer connects the output of the maximum pooling layer into a one-dimensional vector as input, and controls the output to be 128-dimensional, and the formula can be expressed as follows:

F _d ＝W·X+B

wherein: f _d For the generated 128-dimensional depth convolution characteristics, X is a one-dimensional vector obtained after connecting the outputs of the maximum pooling layers, W is a weight matrix, and B is an offset vector;

the model is used in two processes, namely a training process and a testing process. Wherein the training process is used to determine the unknown parameter h _k W and B, the test process is used for generating the deep convolution characteristic F of the character candidate area _d 。

In the training process, each character candidate region for training is given a label. When the label is 0, the character candidate area is not the character area; when the label is 1, it indicates that the character candidate region is a character region. Deep convolution feature F _d Will be connected to the two-dimensional label vector by full connection, with values of 0 and 1, respectively. In the training process, when the label value predicted by the neural model for the character candidate area is not changed any more, the training is finished. The result h at the end of training _k W and b are each independently h _k And W and b are fixed values.

During the test, F generated by convolution neural network in two, two and three stages _d Will be the depth convolution characteristic of the character candidate region.

The output of the maximum pooling layer in the third stage is a 128-dimensional deep convolution feature F _d This feature will be connected to the fully-connected layer when the convolutional neural network is trained. The full connection layer will output whether the character candidate area is a text or non-text label.

Step four: fusing MSSH characteristics and deep convolution characteristics through a self-coding neural network to obtain fused characteristics;

as shown in fig. 3, is a flow chart of feature fusion, comprising the following steps:

first, in the fusion process, the weight ω of the trained convolutional neural network model is used _d As a deep convolution feature F _d The initial fusion weight value of (1);

then, for MSSH feature F _m Predicting the initial fusion weight value omega by using a logistic regression model _m And reducing the size of the characteristic dimension, and the specific process can be represented by the following formula:

wherein the function f _τ () A logistic regression model is represented as a model of,representing the MSSH features after dimensionality reduction, and representing a small data set used for training initial weight values of the features.

Finally, based on self-coding network, MSSH characteristic and deep convolution characteristic are fused to generate fusion characteristic F _s Can be represented by the following formula:

wherein the function f _μ () It is shown that a self-encoded network,the MSSH characteristics after the dimensionality reduction are shown,and F _d Remain dimensionally consistent.

Step five: further screening out a character region from the character candidate region according to the fusion characteristics;

and inputting the fusion characteristics of the character candidate region into a pre-trained logistic regression classifier, and judging whether the character candidate region is a real character region or not.

The training steps of the logistic regression classifier are as follows:

(1) And taking an ICDAR 2013scene data set of the universal scene character detection data set, calculating the fusion characteristics of all candidate character areas of the data set according to the steps, and taking the fusion characteristics as a training set.

(2) Inputting the training set into a logistic regression algorithm to carry out two-classification problem training.

As shown in fig. 6, the character region obtained from fig. 5 is subjected to feature fusion.

Step six: and merging all the character areas to obtain a final character area.

First, for a character region S, all S are calculated _i C is the center point of S _i ；

Second, for arbitrary character region s _i ,s _j E S if the center point c _i And c _j The Euler distance therebetween is less than the threshold value F, then at the center point c _i And c _j Is connected with a straight line l _i,j (ii) a Preferably, F is 5.

Then, calculating included angles alpha between all straight lines l and the horizontal line, and taking the mode alpha of all included angles _mode . Keeping the included angle in the interval [ alpha ] _mode -π/6,α _mode +π/6]The inner straight line and the rest straight lines are removed. As shown in fig. 7, the character area obtained in fig. 6 is merged with the character area.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A character detection method is characterized by comprising the following steps:

calculating MSSH characteristics and depth convolution characteristics, and fusing the MSSH characteristics and the depth convolution characteristics through a self-coding neural network to obtain fused characteristics;

further screening out a character region from the character candidate region according to the fusion characteristic;

and merging all the character areas to obtain a final character area.

2. The text detection method according to claim 1, wherein the specific method for extracting the extremum region is as follows:

r value graph I _R Extreme value region A of _R Is defined as:

wherein I _R (p) representing the value of pixel p in the R-value graph; i is _R (q) representing the value of a pixel point q in the R value graph; theta represents a threshold value of the extremum region;representation and extremum region A _R Adjacent to but not belonging to the extremum region A _R A set of pixels of (a);

g-value graph I _G Extreme value region A of _G Is defined as:

wherein I _G (p) representing the value of pixel point p in the G-value graph; I.C. A _G (q) representing the value of a pixel point q in the G-value graph; theta meterThe threshold value of the extreme value region is shown,representation and extremum region A _G Adjacent but not belonging to the extremum region A _G A set of pixels of (a);

b-value graph I _B Extreme value region A of _B Is defined as:

wherein I _B (p) representing the value of pixel p in the B-value map; i is _B (q) representing the value of a pixel point q in the B-value graph; theta denotes the threshold value of the extremum region,representation and extremum region A _B Adjacent to but not belonging to the extremum region A _B The set of pixels of (1).

3. The character detection method of claim 2, wherein the character candidate regions are obtained by:

calculating the area S, the perimeter C, the Euler number E and the pixel value variance H of each extreme value region, wherein the pixel value variance H is obtained through a gray level image I _gray Calculated, the calculation formula is as follows:

wherein: x represents a pixel point; I.C. A _gray (x) Representing the gray value of the pixel point x; a represents a color interval with the maximum number of pixels in the extreme value area; b represents a color interval with a plurality of pixels in the extreme value area; n is _a Representing the number of pixels in the color interval a in the extremum region; n is _b Representing the number of pixels in the color interval b in the extremum region; r _a Representing a set of pixels in the color interval a in the extremum region; r _b Representing positions in the extremum regionA set of pixels in color bin b; mu.s _a Representing an average value of pixel values in the color interval a in the extremum region; mu.s _b Representing an average value of pixel values in the color interval b in the extremum region;

wherein S is ₀ A threshold value representing an area S of the extremum region; c ₀ A threshold representing the perimeter of the extremum region; e ₀ A threshold value representing an extreme value region Euler number; h ₀ A threshold value representing the variance of the extremum region pixel values.

4. The text detection method of claim 1, wherein the specific method for calculating the MSSH features is as follows:

calculating the symmetrical feature description value of a certain stroke pixel pair in the character candidate region on the gray value and gradient attribute;

F _m (e _i )＝[F _j |j＝V，G _m ，G _o ，Sw，Md，Pa]

wherein: f _m (e _i ) A value representing an MSSH feature vector; []Representing a vector join operation; e.g. of the type _i Representing an ith character candidate region; f _j Representing a feature vector corresponding to the symmetric attribute; j represents a specific type of symmetry property; v represents a gray value; g _m Indicating a large gradientA small attribute; g _o Representing a gradient direction property; sw represents the stroke width value; md represents the stroke sequence value distribution; and a represents a low frequency mode attribute.

5. The text detection method of claim 4, wherein the specific method for obtaining the stroke pixel pairs and the stroke line segments of the character candidate area is as follows:

outputting an edge image by using a Canny edge detection operator;

6. The text detection method of claim 4, wherein the specific method for calculating the deep convolution features is as follows:

adjusting the size of the character candidate area to 64 x 64 pixel values;

constructing a convolutional neural network model comprising three stages;

the first-stage construction method comprises the following steps:

wherein g (a, b, k) represents the value of the pixel value of the line a and the column b in the character candidate area after the k convolution operation; ei (a + m, b + n) represents the (a + m) th row and (b + n) th column pixel value in the ith character candidate region; m represents the row offset of the pixel, n represents the column offset of the pixel, and the value set is { -1,0,1}, h _k Represents the k < th >A convolution kernel; after each convolution layer is operated, a nonlinear activation function is used for calculating an activation value, and the specific formula is as follows:

f(a，b，k)＝max(0，g(a，b，k))

the architecture of the second stage is the same as that of the first stage;

F _d ＝W·X+B

7. The text detection method according to claim 6, wherein the method for obtaining the fusion feature comprises:

for MSSH feature F _m Predicting initial fusion weight value omega of logistic regression model _m And reducing the size of the characteristic dimension, and the specific process can be represented by the following formula:

wherein, the first and the second end of the pipe are connected with each other,representing MSSH features after dimensionality reduction, e _i Representing an ith character candidate region; function f _τ () Representing a logistic regression model, D representing a small data set used to train the feature initial weight values;

generating fusion feature F _S Can be represented by the following formula:

wherein the function f _μ () A self-encoding network is represented that,and F _d Remain consistent in dimension.

8. The text detection method of claim 7, wherein in the fusion training process, when the verification error rate stops decreasing, the joint training process of the self-coding network is ended.

9. The character detection method of claim 1, wherein the specific method for merging character areas is as follows:

For any two character region s _i ，s _j E S, if the Euler distance between the two central points is less than the threshold value F, connecting a straight line l between the two central points _i，i ；

Calculating included angles alpha of all straight lines and horizontal lines, and taking the mode alpha of all included angles _mode (ii) a With a remaining angle in the interval [ alpha ] _mode -π/6，α _mode +π/6]The inner straight line and the rest straight lines are removed;