CN108038486A

CN108038486A - A kind of character detecting method

Info

Publication number: CN108038486A
Application number: CN201711267804.8A
Authority: CN
Inventors: 巫义锐; 黄多辉; 冯钧
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2018-05-15

Abstract

The invention discloses a character detection method, which comprises: extracting the extremum region of the character picture to be detected, filtering the extremum region to obtain character candidate regions; calculating MSSH features and deep convolution features, and passing through the self-encoding neural network The MSSH feature and the deep convolution feature are fused to obtain the fusion feature; the character area is further screened out from the character candidate area according to the fusion feature; all the character areas are merged to obtain the final text area. The detection method of the invention has strong robustness, high detection efficiency, and can quickly complete the text detection task.

Description

A Text Detection Method

技术领域technical field

本发明涉及一种文字检测方法。The invention relates to a character detection method.

背景技术Background technique

文字作为人类最具影响力的发明之一，在人类生活中发挥了重要作用。文字中所包含的丰富而精确的信息，对基于视觉语义的自然场景理解应用具有重大意义。越来越多的多媒体应用程序，例如街道场景理解，无人驾驶汽车对于交通标识的理解和基于语义的图像检索等，均需要准确而鲁棒的文字检测。文字检测的基本任务在于确定场景图像与视频中是否存在文字，如果存在，则标记它的位置。近年来，随着图像获取设备能力和数量的增加，包含场景文字的图像与视频数量相较于过去急剧增加。因此，在自然场景图像与视频中进行文字检测已经受到越来越多的关注。随着计算机视觉相关技术的逐步深入研究，如何利用计算机算法进行场景文字检测已经成为重要和活跃的国际性前沿课题之一。As one of the most influential inventions of mankind, writing has played an important role in human life. The rich and precise information contained in text is of great significance to the application of natural scene understanding based on visual semantics. Accurate and robust text detection is required for more and more multimedia applications, such as street scene understanding, traffic sign understanding for driverless cars, and semantic-based image retrieval. The basic task of text detection is to determine whether there is text in scene images and videos, and if so, mark its location. In recent years, with the increase in the capability and number of image acquisition devices, the number of images and videos containing scene text has increased dramatically compared to the past. Therefore, text detection in natural scene images and videos has received more and more attention. With the gradual and in-depth research of computer vision related technologies, how to use computer algorithms for scene text detection has become one of the important and active international frontier topics.

低质量和复杂背景的场景文字检测与识别极具挑战性。场景文字常具有分辨率低、复杂背景、任意方向、透视变形和光照不均匀等特性，而文档文字拥有统一的格式与单一的背景。Scene text detection and recognition with low quality and complex backgrounds is extremely challenging. Scene text often has the characteristics of low resolution, complex background, arbitrary orientation, perspective deformation, and uneven lighting, while document text has a uniform format and a single background.

发明内容Contents of the invention

本发明在于克服现有技术中的不足，提供一种文字检测方法，解决现有技术中文字检测成功率低、鲁棒性不强的技术问题。The present invention overcomes the deficiencies in the prior art, provides a character detection method, and solves the technical problems of low character detection success rate and weak robustness in the prior art.

为解决上述技术问题，本发明所采用的技术方案是：一种文字检测方法，该方法包括如下步骤：In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is: a kind of character detection method, and this method comprises the steps:

提取待检测文字图片的极值区域，对极值区域进行过滤，得到字符候选区域；Extract the extreme value area of the text image to be detected, filter the extreme value area, and obtain the character candidate area;

计算MSSH特征、深度卷积特征，通过自编码神经网络将MSSH特征、深度卷积特征融合，得到融合特征；Calculate MSSH features and deep convolution features, and fuse MSSH features and deep convolution features through self-encoding neural network to obtain fusion features;

根据融合特征进一步从字符候选区域中筛选出字符区域；Further filter out the character region from the character candidate region according to the fusion feature;

合并所有字符区域得到最终的文字区域。Merge all character regions to get the final text region.

提取极值区域的具体方法如下：The specific method of extracting the extreme value area is as follows:

将待检测文字图片转化为灰度图I_gray、R值图I_R、G值图I_G和B值图I_B；Convert the text image to be detected into a grayscale image I _gray , an R value image I _R , a G value image I _G and a B value image I _B ;

分别对I_R，I_G，I_B求极值区域，具体如下：Calculate the extreme value area for I _R , I _G , and I _B respectively, as follows:

R值图I_R的极值区域A_R定义为：The extreme value area A _R of the R value graph I _R is defined as:

其中I_R(p)表示R值图中像素点p的值；I_R(q)表示R值图中像素点q的值；θ表示极值区域的阈值；表示与极值区域A_R相邻但不属于极值区域A_R的像素集合；Among them, I _R (p) represents the value of pixel point p in the R value map; I _R (q) represents the value of pixel point q in the R value map; θ represents the threshold value of the extreme value region; Represents a set of pixels adjacent to the extreme value area _AR but not belonging to the extreme value area _AR ;

G值图I_G的极值区域A_G定义为：The extreme value region A _G of the G value graph I _G is defined as:

其中I_G(p)表示G值图中像素点p的值；I_G(q)表示G值图中像素点q的值；θ表示极值区域的阈值，表示与极值区域A_G相邻但不属于极值区域A_G的像素集合；Among them, I _G (p) represents the value of pixel point p in the G-value map; I _G (q) represents the value of pixel point q in the G-value map; θ represents the threshold value of the extreme value region, Represents a set of pixels adjacent to the extremum area _AG but not belonging to the extremum area _AG ;

B值图I_B的极值区域A_B定义为：The extreme value area A _B of the B value graph I _B is defined as:

其中I_B(p)表示B值图中像素点p的值；I_B(q)表示B值图中像素点q的值；θ表示极值区域的阈值，表示与极值区域A_B相邻但不属于极值区域A_B的像素集合。Among them, I _B (p) represents the value of pixel point p in the B-value map; I _B (q) represents the value of pixel point q in the B-value map; θ represents the threshold value of the extreme value region, Indicates the set of pixels adjacent to the extremum region A _B but not belonging to the extremum region A _B.

获取字符候选区域的方法如下：The method of obtaining character candidate regions is as follows:

计算每个极值区域的面积S、周长C、欧拉数E、像素值方差H，其中像素值方差H是通过灰度图I_gray计算得到的，其计算公式为：Calculate the area S, perimeter C, Euler number E, and pixel value variance H of each extreme value region, where the pixel value variance H is calculated through the grayscale image I _gray , and its calculation formula is:

其中：x表示一个像素点；I_gray(x)表示像素点x的灰度值；a表示极值区域像素个数最多的颜色区间；b表示极值区域像素个数次多的颜色区间；n_a表示极值区域中处于颜色区间a的像素个数；n_b表示极值区域中处于颜色区间b的像素个数；R_a表示极值区域中处于颜色区间a的像素集合；R_b表示极值区域中处于颜色区间b的像素集合；μ_a表示极值区域中处于颜色区间a的像素值的平均值；μ_b表示极值区域中处于颜色区间b的像素值的平均值；Wherein: x represents a pixel point; I _gray (x) represents the grayscale value of pixel point x; a represents the color interval with the largest number of pixels in the extreme value region; b represents the color interval with the largest number of pixels in the extreme value region; n _a represents the number of pixels in the color interval a in the extreme value region; n _b represents the number of pixels in the color interval b in the extreme value region; R _a represents the set of pixels in the color interval a in the extreme value region; R _b represents the extreme value region The set of pixels in the color interval b in the value area; μ _a represents the average value of the pixel values in the color interval a in the extreme value area; μ _b represents the average value of the pixel values in the color interval b in the extreme value area;

通过每个极值区域的面积S、周长C、欧拉数E、像素值方差H过滤掉多余的极值区域，过滤掉多余的极值区域之后剩下的即为字符候选区域，过滤条件如下：Use the area S, perimeter C, Euler number E, and pixel value variance H of each extreme value area to filter out redundant extreme value areas. After filtering out redundant extreme value areas, the rest is the character candidate area. Filter conditions as follows:

其中，S₀表示极值区域面积S的阈值；C₀表示极值区域周长的阈值；E₀表示极值区域欧拉数的阈值；H₀表示极值区域像素值方差的阈值。Among them, S ₀ represents the threshold value of the area S of the extreme value region; C ₀ represents the threshold value of the circumference of the extreme value region; E ₀ represents the threshold value of the Euler number in the extreme value region; H ₀ represents the threshold value of the variance of the pixel value in the extreme value region.

计算MSSH特征的具体方法如下：The specific method of calculating MSSH features is as follows:

获取字符候选区域的笔画像素对和笔画线段；Obtain stroke pixel pairs and stroke line segments of the character candidate region;

计算字符候选区域的某一笔画像素对在灰度值、梯度属性上的对称特征描述值；Calculate the symmetrical feature description value of a certain stroke pixel pair in the gray value and gradient attribute of the character candidate area;

计算字符候选区域内的所有笔画线段在笔画宽度值、笔画序列值分布和低频模式属性上的对称特征描述；Calculate the symmetrical feature description of all stroke line segments in the character candidate area on the stroke width value, stroke sequence value distribution and low-frequency pattern attributes;

将不同对称属性的特征值连接形成MSSH特征，具体公式表示如下：The eigenvalues of different symmetric attributes are connected to form the MSSH feature, and the specific formula is expressed as follows:

F_m(e_i)＝[F_j|＝V,G_m,G_o,Sw,Md,Pa]F _m (e _i )=[F _j |=V,G _m ,G _o ,Sw,Md,Pa]

其中：F_m(e_i)表示MSSH特征向量的值；[]表示向量连接操作；e_i表示第i个字符候选区域；F_j表示对称属性所对应特征向量；j表示对称属性的具体类型；V表示灰度值；G_m表示梯度大小属性；G_o表示梯度方向属性；Sw表示笔画宽度值；Md表示笔画序列值分布；Pa表示低频模式属性。Among them: F _m (e _i ) represents the value of the MSSH feature vector; [] represents the vector connection operation; e _i represents the ith character candidate region; F _j represents the feature vector corresponding to the symmetric attribute; j represents the specific type of the symmetric attribute; V represents the gray value; G _m represents the gradient size attribute; G _o represents the gradient direction attribute; Sw represents the stroke width value; Md represents the stroke sequence value distribution; Pa represents the low-frequency pattern attribute.

获取字符候选区域的笔画像素对和笔画线段的具体方法如下：The specific method of obtaining the stroke pixel pairs and stroke line segments of the character candidate area is as follows:

使用Canny边缘检测算子输出边缘图像；Use the Canny edge detection operator to output the edge image;

计算笔画边缘图像上某一像素点p的梯度方向；Calculate the gradient direction of a certain pixel point p on the stroke edge image;

跟随由梯度方向所确定的射线r，直到射线与另一个笔画边缘像素点q相遇；Follow the ray r determined by the gradient direction until the ray meets another stroke edge pixel point q;

笔画像素对被定义为{p,q},笔画线段被定义为射线r在像素点p与q之间的距离。A stroke pixel pair is defined as {p, q}, and a stroke line segment is defined as the distance of ray r between pixel points p and q.

计算深度卷积特征的具体方法如下：The specific method of calculating the depth convolution feature is as follows:

将字符候选区域大小调整为64×64像素值；Resize the character candidate area to a 64×64 pixel value;

构造包含三阶段的卷积神经网络模型；Construct a three-stage convolutional neural network model;

一阶段构造方法如下：The one-stage construction method is as follows:

一阶段顺序使用两个卷积层与一个最大池化层，其中，卷积层均采用32个尺寸为3×3的卷积核，1个像素为位移偏移量，与字符候选区域进行卷积运算，具体公式如下：In the first stage, two convolutional layers and a maximum pooling layer are used sequentially. Among them, the convolutional layer uses 32 convolution kernels with a size of 3×3, and 1 pixel is the displacement offset, which is convolved with the character candidate area. Product operation, the specific formula is as follows:

其中g(a,b,k)表示字符候选区域中的第a行第b列像素值经第k个卷积运算的值；e_i(a+m,b+n)表示第i个字符候选区域中第(a+m)行第(b+n)列像素值；m表示像素的行偏移量，n表示像素的列偏移量，其取值集合为{-1,0,1},h_k表示第k个卷积核；各卷积层运算后，均使用非线性激活函数计算激活值，具体公式如下：Where g(a,b,k) represents the value of the pixel value of row a and column b in the character candidate area after the kth convolution operation; e _i (a+m,b+n) represents the i-th character candidate The pixel value of row (a+m) and column (b+n) in the area; m represents the row offset of the pixel, n represents the column offset of the pixel, and its value set is {-1,0,1} , h _k represents the kth convolution kernel; after each convolution layer operation, the activation value is calculated using a nonlinear activation function, the specific formula is as follows:

f(a,b,k)＝max(0,g(a,b,k))f(a,b,k)=max(0,g(a,b,k))

f(a,b,k)表示字符候选区域中的第a行第b列像素值经第k个卷积运算后的激活值；max()表示取大值函数；f(a,b,k) represents the activation value of the pixel value of row a and column b in the character candidate area after the kth convolution operation; max() represents the function of taking a large value;

激活值而后传递至最大池化层，该层以2个像素作为步幅，取2×2空间邻域内的最大值作为输出值；The activation value is then passed to the maximum pooling layer, which uses 2 pixels as the stride and takes the maximum value in the 2×2 spatial neighborhood as the output value;

二阶段的架构与一阶段的架构相同；The structure of the second stage is the same as that of the first stage;

三阶段顺序使用三个卷积层、一个最大池化层与一个全连接层，其中全连接层将最大池化层的输出连接成为一个一维向量作为输入，并将输出控制在128维，其公式可表示如下：The three-stage sequence uses three convolutional layers, a maximum pooling layer, and a fully connected layer. The fully connected layer connects the output of the maximum pooling layer into a one-dimensional vector as input, and controls the output to 128 dimensions. The formula can be expressed as follows:

F_d＝W·X+BF _d ＝W·X+B

其中：F_d为所生成的128维深度卷积特征，X为将最大池化层的输出连接后所得到的一维向量，W为权值矩阵，B为偏移向量；Among them: F _d is the generated 128-dimensional deep convolution feature, X is the one-dimensional vector obtained after connecting the output of the maximum pooling layer, W is the weight matrix, and B is the offset vector;

对卷积神经网络模型进行训练和测试，通过训练确定未知参数h_k，W与B的取值，经测试所生成的F_d作为字符候选区域的深度卷积特征。The convolutional neural network model is trained and tested, the values of unknown parameters h _k , W and B are determined through training, and the F _d generated by the test is used as the deep convolution feature of the character candidate area.

获取融合特征的方法如下：The method of obtaining fusion features is as follows:

使用训练后卷积神经网络模型的权重ω_d作为深层卷积特征F_d的初始融合权重值；Use the weight ω _d of the convolutional neural network model after training as the initial fusion weight value of the deep convolution feature F _d ;

对于MSSH特征F_m采用逻辑回归模型来预测其初始融合权重值ω_m和减少其特征维度大小，具体过程可由以下公式表示：For the MSSH feature F _m , the logistic regression model is used to predict its initial fusion weight value ω _m and reduce its feature dimension size. The specific process can be expressed by the following formula:

其中，表示降维后的MSSH特征，e_i表示第i个字符候选区域；函数f_τ()表示逻辑回归模型，表示用于训练该特征初始权重值的小数据集；in, Represents the MSSH feature after dimensionality reduction, e _i represents the i-th character candidate region; the function f _τ () represents the logistic regression model, representing the small data set used to train the initial weight value of the feature;

产生融合特征F_s的具体过程可由以下公式表示：The specific process of generating the fusion feature F _s can be expressed by the following formula:

其中，函数f_μ()表示自编码网络，和F_d在维度上保持一致。Among them, the function f _μ () represents the self-encoder network, and F _d are consistent in dimension.

在融合训练过程中，当验证错误率停止降低时，自编码网络的联合训练过程结束。During fused training, the joint training process of autoencoder networks ends when the validation error rate stops decreasing.

合并字符区域的具体方法如下：The specific method of merging character areas is as follows:

假设字符区域为S，计算所有s_i∈S的中心点c_i；Assuming that the character area is S, calculate the center point c _i of all s _i ∈ S;

对于任意两字符区域s_i,s_j∈S，如果二者中心点之间的欧拉距离小于阈值F，则在二者中心点之间连一条直线l_i,；For any two-character area s _i , s _j ∈ S, if the Euler distance between the two center points is less than the threshold F, connect a straight line l _i, between the two center points;

计算所有直线与水平线的夹角α，取所有夹角的众数α_mode；保留夹角在区间[α_mode-π/6,α_mode+π/6]内的直线，其余直线去掉；Calculate the angle α between all the straight lines and the horizontal line, and take the mode α _mode of all the included angles; keep the straight lines whose included angles are in the interval [α _mode -π/6,α _mode +π/6], and remove the other straight lines;

合并有直线连接的字符区域，得到最终的文字区域。The character regions connected by straight lines are merged to obtain the final text region.

与现有技术相比，本发明所达到的有益效果：Compared with the prior art, the beneficial effects achieved by the present invention are as follows:

1、用MSSH特征和深度卷积特征来描述字符候选区域，其中MSSH特征是基于边缘图像的，对低分辨率、图片旋转、仿射形变和多语言多字体变化具有强鲁棒性；深度卷积特征构造过程无需人工干预，对于字符候选区域的外观属性有很强描述能力，在低分辨率、图片旋转和光照变化过程中，图片的整体外观变化不大，同样具有较强的鲁棒性；1. Use MSSH features and deep convolution features to describe character candidate areas, where MSSH features are based on edge images, and are robust to low resolution, image rotation, affine deformation, and multi-language and multi-font changes; depth convolution The product feature construction process does not require manual intervention, and has a strong ability to describe the appearance attributes of character candidate areas. In the process of low resolution, image rotation, and illumination changes, the overall appearance of the image does not change much, and it also has strong robustness. ;

(2)本发明所使用的自编码网络无需人工干预，能够自动融合MSSH特征和深度卷积特征，生成的融合特征可以集合各特征的优点，对低分辨率、图片旋转、仿射形变和复杂背景有强鲁棒性。(2) The self-encoding network used in the present invention can automatically fuse MSSH features and deep convolution features without manual intervention, and the generated fusion features can integrate the advantages of each feature, which is suitable for low resolution, image rotation, affine deformation and complex The background is robust.

(3)本发明涉及的自然场景文字检测的方法效率高，计算的算法复杂度都不高，可以较快完成文字检测流程。(3) The method for character detection in natural scenes involved in the present invention has high efficiency, and the calculation algorithm complexity is not high, and the character detection process can be completed quickly.

附图说明Description of drawings

图1是本发明的流程图；Fig. 1 is a flow chart of the present invention;

图2是图1深度卷积特征的计算流程图；Fig. 2 is a calculation flowchart of the depth convolution feature in Fig. 1;

图3是图1中特征融合的流程图；Fig. 3 is a flowchart of feature fusion in Fig. 1;

图4是待检测文字图片；Figure 4 is a text picture to be detected;

图5是经极值区域过滤由图4得到的字符候选区域的图片；Fig. 5 is the picture of the character candidate region obtained by Fig. 4 through extremum region filtering;

图6是经特征融合由图5得到的字符区域；Fig. 6 is the character region obtained by Fig. 5 through feature fusion;

图7是经字符区域合并由图6得到的文字区域。Fig. 7 is the text area obtained from Fig. 6 by merging character areas.

具体实施方式Detailed ways

本发明提供了一种文字检测方法，通过提取、过滤极值区域，得到字符候选区域，通过MSSH特征和深度卷积特征融合进一步从字符候选区域中筛选出字符区域，最后通过字符区域合并得到文字区域。本发明检测方法具有较强的鲁棒性、检测效率高、可以快速完成文字检测任务。The present invention provides a text detection method, by extracting and filtering the extreme value area, the character candidate area is obtained, the character area is further screened out from the character candidate area through the fusion of MSSH feature and depth convolution feature, and finally the character area is merged to obtain the text area. The detection method of the invention has strong robustness, high detection efficiency, and can quickly complete the text detection task.

下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案，而不能以此来限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.

如图1所示，是本发明的流程图，本发明方法具体包括如下步骤：As shown in Figure 1, it is a flowchart of the present invention, and the inventive method specifically includes the following steps:

步骤一：输入待检测文字图片，提取待检测文字图片的极值区域；Step 1: Input the text picture to be detected, and extract the extreme value area of the text picture to be detected;

首先，将输入的RGB彩色图像转化为灰度图I_gray，红色分量图R值图I_R、绿色分量图G值图I_G和蓝色分量图B值图I_B；First, convert the input RGB color image into a gray scale image I _gray , a red component image R value image I _R , a green component image G value image I _G and a blue component image B value image I _B ;

其次，分别对I_R，I_G，I_B求极值区域A_R，A_G，A_B，极值区域指的是区域外边界的像素值严格大于区域内像素值的区域，以R值图I_R为例，其极值区域A_R可以定义为：Secondly, for I _R , I _G , and I _B , calculate the extreme value regions _AR , A _G , A _B respectively. The extreme value region refers to the region where the pixel value of the outer boundary of the region is strictly greater than the pixel value in the region, and the R value diagram Taking I _R as an example, its extremum region _AR can be defined as:

其中I_R(p)和I_R(q)分别表示I_R中像素点p和q的值，θ表示极值区域的阈值，表示与极值区域A_R相邻但不属于极值区域A_R的像素集合；Among them, I _R (p) and I _R (q) represent the values of pixel points p and q in I _R respectively, and θ represents the threshold value of the extreme value region, Represents a set of pixels adjacent to the extreme value area _AR but not belonging to the extreme value area _AR ;

然后，计算每个极值区域的面积S、周长C、欧拉数E、像素值方差H，其中像素值方差H是通过灰度图I_gray计算得到的，其计算公式为：Then, calculate the area S, perimeter C, Euler number E, and pixel value variance H of each extreme value region, where the pixel value variance H is calculated through the grayscale image I _gray , and its calculation formula is:

其中x表示一个像素点，I_gray(x)表示像素点x的灰度值，a和b分别为极值区域像素个数最多的颜色区间和像素个数次多的颜色区间，n_a和n_b分别表示极值区域中处于颜色区间a和b的像素个数，R_a和R_b分别表示极值区域中处于颜色区间a和b的像素集合，μ_a和μ_b分别表示极值区域中处于颜色区间a和b的像素值的平均值。Wherein x represents a pixel point, I _gray (x) represents the gray value of pixel point x, a and b are respectively the color interval with the largest number of pixels in the extreme value region and the color interval with the largest number of pixels, n _a and n _b respectively represent the number of pixels in the color interval a and b in the extreme value region, R _a and R _b represent the pixel sets in the color interval a and b in the extreme value region respectively, μ _a and μ _b represent the pixels in the extreme value region Average of pixel values in color intervals a and b.

步骤二：对极值区域进行过滤，得到字符候选区域；Step 2: Filter the extreme value area to obtain the character candidate area;

通过每个极值区域的面积S、周长C、欧拉数E、像素值方差H过滤掉多余的极值区域，过滤掉多余的极值区域之后剩下的即为字符候选区域。过滤条件如下：Use the area S, perimeter C, Euler number E, and pixel value variance H of each extremum region to filter out redundant extremum regions, and what remains after filtering out redundant extremum regions is the character candidate region. The filter conditions are as follows:

其中，S₀,C₀,E₀,H₀都是通过大量的字符和非字符区域统计得到的阈值。S₀表示极值区域面积S的阈值，S₀具体数值在区间[80,120内；C₀表示极值区域周长的阈值，C₀具体数值在区间[30,50内；E₀表示极值区域欧拉数的阈值，E₀具体数值在区间[0,1]内；H₀表示极值区域像素值方差的阈值，H₀具体数值在区间[100,200]内。Among them, S ₀ , C ₀ , E ₀ , and H ₀ are all thresholds obtained through statistics of a large number of character and non-character regions. S ₀ indicates the threshold value of the area S of the extreme value area, and the specific value of S ₀ is within the interval [80,120; C ₀ indicates the threshold value of the circumference of the extreme value area, and the specific value of C ₀ is within the interval [30,50]; E ₀ indicates the extreme value area The threshold of the Euler number, the specific value of E ₀ is in the interval [0,1]; H ₀ represents the threshold of the variance of the pixel value in the extreme value area, and the specific value of H ₀ is in the interval [100,200].

图4是是输入的待检测文字图片，如图5所示，是经极值区域过滤由图4得到的字符候选区域的图片。Fig. 4 is an input character picture to be detected. As shown in Fig. 5, it is a picture of the character candidate area obtained by filtering the extremum area from Fig. 4 .

步骤三：计算MSSH特征、深度卷积特征；Step 3: Calculate MSSH features and deep convolution features;

通过SWT算法获取字符候选区域的笔画像素对与笔画线段，步骤如下：Obtain the stroke pixel pairs and stroke line segments of the character candidate area through the SWT algorithm, the steps are as follows:

(1)使用Canny边缘检测算子输出边缘图像；(1) Use the Canny edge detection operator to output the edge image;

(2)计算笔画边缘图像上某一像素p的梯度方向；(2) Calculate the gradient direction of a certain pixel p on the stroke edge image;

(3)跟随由梯度方向所确定的射线r，直到射线与另一个笔画边缘像素点q相遇；(3) Follow the ray r determined by the gradient direction until the ray meets another stroke edge pixel point q;

(4)笔画像素对被定义为{p,q},笔画线段被定义为射线r在像素点p与q之间的距离。(4) The stroke pixel pair is defined as {p, q}, and the stroke segment is defined as the distance between the ray r between the pixel points p and q.

其中Canny边缘检测算法的步骤如下：The steps of the Canny edge detection algorithm are as follows:

(1)把字符候选区域转化为灰度图；(1) Convert the character candidate area into a grayscale image;

(2)对得到的灰度图进行高斯滤波；(2) Gaussian filtering is carried out to the obtained grayscale image;

(3)计算梯度的幅值和方向；(3) Calculate the magnitude and direction of the gradient;

(4)对梯度幅值进行非极大值抑制；(4) Perform non-maximum suppression on the gradient amplitude;

(5)用双阈值算法检测和连接边缘。(5) Detect and connect edges with a double-threshold algorithm.

假设{p,q}是字符候选区域内的某一笔画像素对，基于笔画像素对的对称属性描述值的计算步骤如下：Assuming {p,q} is a certain stroke pixel pair in the character candidate area, the calculation steps of the symmetrical attribute description value based on the stroke pixel pair are as follows:

(1)通过以下公式计算笔画像素对{p,q}在灰度值与梯度大小属性上的特征值F_j(p,q)₁：(1) Calculate the feature value F _j (p,q) ₁ of the stroke pixel pair {p,q} on the gray value and gradient size attribute by the following formula:

F_j(p,q)₁＝f_h(|I_j(p)-I_j(q)|)if∈{V,G_m}F _j (p,q) ₁ ＝f _h (|I _j (p)-I _j (q)|)if∈{V,G _m }

其中，I_j(p)表示像素点p在对称属性j上的值，I_j(q)表示像素点q在对称属性j上的值；j表示对称属性的具体类型；；{V,G_m}分别指灰度值与梯度大小属性，函数f_h()表示直方图统计运算。Among them, I _j (p) represents the value of pixel point p on symmetric attribute j, I _j (q) represents the value of pixel point q on symmetric attribute j; j represents the specific type of symmetric attribute; {V, G _m } respectively refer to the gray value and gradient size attributes, and the function f _h () represents the statistical operation of the histogram.

(2)我们通过以下公式计算笔画像素对{p,q}在梯度方向属性上的特征值F_j(p,q)₂：(2) We calculate the feature value F _j (p,q) ₂ of the stroke pixel pair {p,q} on the gradient direction attribute by the following formula:

F_j(p,q)₂＝f_h(cos＜I_j(p),_j(q)＞)j＝G_o F _j (p,q) ₂ ＝f _h (cos<I _j (p), _j (q)>)j=G _o

其中，G_o指梯度方向属性，cos<>表示反余弦函数，函数f_h()表示直方图统计运算。Among them, G _o refers to the gradient direction attribute, cos<> represents the arccosine function, and the function f _h () represents the histogram statistical operation.

假设s表示对于字符候选区域内的笔画线段集合，基于笔画线段集合的对称属性描述值的计算步骤如下：Assuming s means that for the stroke line segment set in the character candidate area, the calculation steps of the symmetric attribute description value based on the stroke line segment set are as follows:

(1)通过以下公式计算笔画像素对在梯度方向属性上的特征值F_j(s)：(1) Calculate the eigenvalue F _j (s) of the stroke pixel pair on the gradient direction attribute by the following formula:

F_j(s)＝f_h(f_ξ(s,j))j∈{Sw,Md,Pa}F _j (s)=f _h (f _ξ (s,j))j∈{Sw,Md,Pa}

其中，函数f_h()表示直方图统计运算，{Sw,Md,Pa}表示同类对称属性，包括笔画宽度值Sw，笔画序列值分布Md和低频模式属性Pa，其中函数f_ξ(s,j)可以定义为：Among them, the function f _h () represents the statistical operation of the histogram, {Sw, Md, Pa} represents the same kind of symmetrical attributes, including the stroke width value Sw, the stroke sequence value distribution Md and the low-frequency pattern attribute Pa, where the function f _ξ (s, j ) can be defined as:

其中||||指欧式距离，D_s和M_s分别表示笔画线段集合s中，笔画线段所含像素的灰度值方差和平均值，表示Haar小波变换，k表示小波转化层数，n_l表示最高尺度层数，ω_k是预定义的权重参数。其中，n_l的具体数值为1。当k＝0时，ω_k的具体数值为0.1；当k＝1时，ω_k的具体数值为0.3；当k＝2时，ω_k的具体数值为0.5。Where |||| refers to the Euclidean distance, D _s and M _s respectively represent the variance and average value of the gray value of the pixels contained in the stroke line segment set s, Represents Haar wavelet transform, k represents the number of wavelet transformation layers, n _l represents the highest scale layer, and ω _k is a predefined weight parameter. Wherein, the specific value of n _l is 1. When k=0, the specific value of ω _k is 0.1; when k=1, the specific value of ω _k is 0.3; when k=2, the specific value of ω _k is 0.5.

(2)对于笔画宽度值、笔画序列值分布和低频模式属性上的对称特征描述值，将某一文字候选区域的上述属性值等比例缩放至0到1之间。(2) For stroke width value, stroke sequence value distribution, and symmetric feature description value on low-frequency mode attributes, scale the above attribute values of a certain text candidate area to a scale between 0 and 1.

F_m(e_i)＝[F_j|＝V,G_m,o,w,Md,a]F _m (e _i )=[F _j |=V,G _m ,o,w,Md,a]

其中：F_m(e_i)表示MSSH特征向量的值；[]表示向量连接操作；e_i表示第i个文字候选区域；F_j表示对称属性所对应特征向量；j表示对称属性的具体类型；V表示灰度值；G_m表示梯度大小属性；G_o表示梯度方向属性；Sw表示笔画宽度值；Md表示笔画序列值分布；Pa表示低频模式属性。Among them: F _m (e _i ) represents the value of the MSSH feature vector; [] represents the vector connection operation; e _i represents the i-th text candidate region; F _j represents the feature vector corresponding to the symmetric attribute; j represents the specific type of the symmetric attribute; V represents the gray value; G _m represents the gradient size attribute; G _o represents the gradient direction attribute; Sw represents the stroke width value; Md represents the stroke sequence value distribution; Pa represents the low-frequency pattern attribute.

如图2所示，是深度卷积特征的计算流程图，计算深度卷积特征的方法如下：As shown in Figure 2, it is a calculation flow chart of deep convolution features. The method of calculating deep convolution features is as follows:

一阶段构造方法如下：The one-stage construction method is as follows:

f(a,b,k)＝max(0,g(a,b,k))f(a,b,k)=max(0,g(a,b,k))

F_d＝W·X+BF _d ＝W·X+B

对模型的使用共分两个过程，分别为训练过程与测试过程。其中训练过程用于确定未知参数h_k，W与B的取值，测试过程用于生成字符候选区域的深度卷积特征F_d。The use of the model is divided into two processes, namely the training process and the testing process. The training process is used to determine the values of the unknown parameters h _k , W and B, and the testing process is used to generate the deep convolution feature F _d of the character candidate area.

在训练过程中，每个用于训练的字符候选区域均被赋予标签。当标签为0时，表示该字符候选区域不是字符区域；当标签为1时，表示该字符候选区域是字符区域。深度卷积特征F_d将通过全连接的方式连接到二维标签向量，其值分别为0与1。在训练过程中，，当神经模型为字符候选区域预测的标签值不再改变时，训练结束。训练结束时的所得到h_k，W与b分别作为h_k，W与b的固定取值。During the training process, each character candidate region used for training is assigned a label. When the label is 0, it means that the character candidate area is not a character area; when the label is 1, it means that the character candidate area is a character area. The depthwise convolutional feature F _d will be connected to the two-dimensional label vector through full connection, and its values are 0 and 1 respectively. During the training process, when the label value predicted by the neural model for the character candidate region no longer changes, the training ends. The obtained h _k , W and b at the end of the training are respectively used as fixed values of h _k , W and b.

在测试过程中，通过卷积神经网络一二三阶段所生成的F_d将作为字符候选区域的深度卷积特征。During the test, the F _d generated by the first, second and third stages of the convolutional neural network will be used as the deep convolution feature of the character candidate area.

第三阶段中最大池化层的输出为128维的深度卷积特征F_d，该特征在卷积神经网络训练时，将被连接至全连接层。全连接层将输出该字符候选区域是否为文字或非文字标签。The output of the maximum pooling layer in the third stage is the 128-dimensional deep convolutional feature F _d , which will be connected to the fully connected layer during the training of the convolutional neural network. The fully connected layer will output whether the character candidate region is a text or non-text label.

步骤四：通过自编码神经网络将MSSH特征、深度卷积特征融合，得到融合特征；Step 4: Merge MSSH features and deep convolution features through self-encoding neural network to obtain fusion features;

如图3所示，是特征融合的流程图，包括如下步骤：As shown in Figure 3, it is a flow chart of feature fusion, including the following steps:

首先，在融合过程中，使用训练后卷积神经网络模型的权重ω_d作为深层卷积特征F_d的初始融合权重值；First, in the fusion process, the weight ω _d of the trained convolutional neural network model is used as the initial fusion weight value of the deep convolutional feature F _d ;

然后，对于MSSH特征F_m，采用逻辑回归模型来预测其初始融合权重值ω_m和减少其特征维度大小，具体过程可由以下公式表示：Then, for the MSSH feature F _m , use the logistic regression model to predict its initial fusion weight value ω _m and reduce its feature dimension size. The specific process can be expressed by the following formula:

其中，函数f_τ()表示逻辑回归模型，表示降维后的MSSH特征，表示用于训练该特征初始权重值的小数据集。Among them, the function f _τ () represents the logistic regression model, Represents the MSSH feature after dimensionality reduction, and represents the small data set used to train the initial weight value of the feature.

最后，基于自编码网络融合MSSH特征与深度卷积特征，产生融合特征F_s的具体过程可由以下公式表示：Finally, based on the self-encoder network fusion of MSSH features and deep convolution features, the specific process of generating fusion features F _s can be expressed by the following formula:

其中，函数f_μ()表示自编码网络，表示降维后的MSSH特征，和F_d在维度上保持一致。Among them, the function f _μ () represents the self-encoder network, Represents the MSSH feature after dimensionality reduction, and F _d are consistent in dimension.

步骤五：根据融合特征进一步从字符候选区域中筛选出字符区域；Step 5: further filter out the character region from the character candidate region according to the fusion feature;

将字符候选区域的融合特征输入预先训练好的逻辑回归分类器，判断该字符候选区域是不是真正的字符区域。Input the fusion feature of the character candidate region into the pre-trained logistic regression classifier to judge whether the character candidate region is a real character region.

其中逻辑回归分类器的训练步骤如下：The training steps of the logistic regression classifier are as follows:

(1)取通用场景文字检测数据集ICDAR 2013scene数据集，按照上述步骤计算该数据集的所有候选字符区域的融合特征，并以此作为训练集。(1) Take the ICDAR 2013 scene data set, a general scene text detection data set, and calculate the fusion features of all candidate character regions of the data set according to the above steps, and use it as a training set.

(2)把该训练集输入逻辑回归算法进行二分类问题训练。(2) Input the training set into the logistic regression algorithm for binary classification problem training.

如图6所示，是经特征融合由图5得到的字符区域。As shown in Figure 6, it is the character area obtained from Figure 5 through feature fusion.

步骤六：合并所有字符区域得到最终的文字区域。Step 6: Merge all character regions to obtain the final text region.

首先，对于字符区域S，计算所有s_i∈S的中心点c_i；First, for the character area S, calculate the center points c _i of all s _i ∈ S;

其次，对于任意字符区域s_i,s_j∈S，如果中心点c_i和c_j之间的欧拉距离小于阈值F，则在中心点c_i和c_j之间连一条直线l_i,j；优选的，F取值为5。Secondly, for any character area s _i , s _j ∈ S, if the Euler distance between the center points ci and c _j is less than the threshold F, connect a straight line l _i,j _between the center points _ci and c _j ; Preferably, the value of F is 5.

然后，计算所有直线l与水平线的夹角α，取所有夹角的众数α_mode。保留夹角在区间[α_mode-π/6,α_mode+π/6]内的直线，其余直线去掉。如图7所示，是经字符区域合并由图6得到的文字区域。Then, calculate the included angle α between all straight lines l and the horizontal line, and take the mode α _mode of all included angles. Keep the straight lines whose included angles are in the interval [α _mode -π/6,α _mode +π/6], and remove the rest of the straight lines. As shown in FIG. 7, it is the character area obtained by merging the character areas from FIG. 6.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明技术原理的前提下，还可以做出若干改进和变形，这些改进和变形也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the technical principle of the present invention, some improvements and modifications can also be made. It should also be regarded as the protection scope of the present invention.

Claims

1. a text detection method, is characterized in that, the method comprises the steps:

Extract the extreme value area of the text image to be detected, filter the extreme value area, and obtain the character candidate area;

Calculate MSSH features and deep convolution features, and fuse MSSH features and deep convolution features through self-encoding neural network to obtain fusion features;

Further filter out the character region from the character candidate region according to the fusion feature;

Merge all character regions to get the final text region.

2. text detection method according to claim 1, is characterized in that, the concrete method of extracting extremum region is as follows:

Convert the text image to be detected into a grayscale image I _gray , an R value image I _R , a G value image I _G and a B value image I _B ;

Calculate the extreme value area for I _R , I _G , and I _B respectively, as follows:

The extreme value area A _R of the R value graph I _R is defined as:

<mrow><msub><mi>I</mi><mi>R</mi></msub><mrow><mo>(</mo><mi>p</mi><mo>)</mo></mrow><mo>></mo><mi>&theta;</mi><mo>&GreaterEqual;</mo><msub><mi>I</mi><mi>R</mi></msub><mrow><mo>(</mo><mi>q</mi><mo>)</mo></mrow><mo>,</mo><mo>&ForAll;</mo><mi>p</mi><mo>&Element;</mo><msub><mi>A</mi><mi>R</mi></msub><mo>,</mo><mi>q</mi><mo>&Element;</mo><mo>&part;</mo><msub><mi>A</mo>mi><mi>R</mi></msub></mrow>

Among them, I _R (p) represents the value of pixel point p in the R value map; I _R (q) represents the value of pixel point q in the R value map; θ represents the threshold value of the extreme value region; Represents a set of pixels adjacent to the extreme value area _AR but not belonging to the extreme value area _AR ;

The extreme value region A _G of the G value graph I _G is defined as:

<mrow><msub><mi>I</mi><mi>G</mi></msub><mrow><mo>(</mo><mi>p</mi><mo>)</mo></mrow><mo>></mo><mi>&theta;</mi><mo>&GreaterEqual;</mo><msub><mi>I</mi><mi>G</mi></msub><mrow><mo>(</mo><mi>q</mi><mo>)</mo></mrow><mo>,</mo><mo>&ForAll;</mo><mi>p</mi><mo>&Element;</mo><msub><mi>A</mi><mi>G</mi></msub><mo>,</mo><mi>q</mi><mo>&Element;</mo><mo>&part;</mo><msub><mi>A</mo>mi><mi>G</mi></msub></mrow>

Among them, I _G (p) represents the value of pixel point p in the G-value map; I _G (q) represents the value of pixel point q in the G-value map; θ represents the threshold value of the extreme value region, Represents a set of pixels adjacent to the extremum area _AG but not belonging to the extremum area _AG ;

The extreme value area A _B of the B value graph I _B is defined as:

<mrow><msub><mi>I</mi><mi>B</mi></msub><mrow><mo>(</mo><mi>p</mi><mo>)</mo></mrow><mo>></mo><mi>&theta;</mi><mo>&GreaterEqual;</mo><msub><mi>I</mi><mi>B</mi></msub><mrow><mo>(</mo><mi>q</mi><mo>)</mo></mrow><mo>,</mo><mo>&ForAll;</mo><mi>p</mi><mo>&Element;</mo><msub><mi>A</mi><mi>B</mi></msub><mo>,</mo><mi>q</mi><mo>&Element;</mo><mo>&part;</mo><msub><mi>A</mo>mi><mi>B</mi></msub></mrow>

Among them, I _B (p) represents the value of pixel point p in the B-value map; I _B (q) represents the value of pixel point q in the B-value map; θ represents the threshold value of the extreme value region, Indicates the set of pixels adjacent to the extremum region A _B but not belonging to the extremum region A _B.

3. text detection method according to claim 2, is characterized in that, the method for obtaining character candidate region is as follows:

Calculate the area S, perimeter C, Euler number E, and pixel value variance H of each extreme value region, where the pixel value variance H is calculated through the grayscale image I _gray , and its calculation formula is:

<mrow><mi>H</mi><mo>=</mo><mfrac><mrow><msub><mi>n</mi><mi>a</mi></msub><mo>&CenterDot;</mo><msub><mi>&Sigma;</mi><mrow><mi>x</mi><mo>&Element;</mo><msub><mi>R</mi><mi>a</mi></msub></mrow></msub><msup><mrow><mo>(</mo><msub><mi>I</mi><mrow><mi>g</mi><mi>r</mi><mi>a</mi><mi>y</mi></mrow></msub><mo>(</mo><mi>x</mi><mo>)</mo><mo>-</mo><msub><mi>&mu;</mi><mi>a</mi></msub><mo>)</mo></mrow><mn>2</mn></msup><mo>+</mo><msub><mi>n</mi><mi>b</mi></msub><mo>&CenterDot;</mo><msub><mi>&Sigma;</mi><mrow><mi>x</mi><mo>&Element;</mo><msub><mi>R</mi><mi>b</mi></msub></mrow></msub><msup><mrow><mo>(</mo><msub><mi>I</mi><mrow><mi>g</mi><mi>r</mi><mi>a</mi><mi>y</mi></mrow></msub><mo>(</mo><mi>x</mi><mo>)</mo><mo>-</mo><msub><mi>&mu;</mi><mi>b</mi></msub><mo>)</mo></mrow><mn>2</mn></msup></mrow><mrow><msub><mi>n</mi><mi>a</mi></msub><mo>+</mo><msub><mi>n</mi><mi>b</mi></msub></mrow></mfrac></mrow>

Wherein: x represents a pixel point; I _gray (x) represents the grayscale value of pixel point x; a represents the color interval with the largest number of pixels in the extreme value region; b represents the color interval with the largest number of pixels in the extreme value region; n _a represents the number of pixels in the color interval a in the extreme value region; n _b represents the number of pixels in the color interval b in the extreme value region; R _a represents the set of pixels in the color interval a in the extreme value region; R _b represents the extreme value region The set of pixels in the color interval b in the value area; μ _a represents the average value of the pixel values in the color interval a in the extreme value area; μ _b represents the average value of the pixel values in the color interval b in the extreme value area;

Use the area S, perimeter C, Euler number E, and pixel value variance H of each extreme value area to filter out redundant extreme value areas. After filtering out redundant extreme value areas, the rest is the character candidate area. Filter conditions as follows:

<mfenced open = "{" close = ""><mtable><mtr><mtd><mrow><mi>S</mi><mo>&le;</mo><msub><mi>S</mi><mn>0</mn></msub></mrow></mtd></mtr><mtr><mtd><mrow><mi>C</mi><mo>&le;</mo><msub><mi>C</mi><mn>0</mn></msub></mrow></mtd></mtr><mtr><mtd><mrow><mi>E</mi><mo>&le;</mo><msub><mi>E</mi><mn>0</mn></msub></mrow></mtd></mtr><mtr><mtd><mrow><mi>H</mi><mo>&GreaterEqual;</mo><msub><mi>H</mi><mn>0</mn></msub></mrow></mtd></mtr></mtable></mfenced>

Among them, S ₀ represents the threshold value of the area S of the extreme value region; C ₀ represents the threshold value of the circumference of the extreme value region; E ₀ represents the threshold value of the Euler number in the extreme value region; H ₀ represents the threshold value of the variance of the pixel value in the extreme value region.

4. text detection method according to claim 1, is characterized in that, the concrete method of calculating MSSH feature is as follows:

Obtain stroke pixel pairs and stroke line segments of the character candidate region;

Calculate the symmetrical feature description value of a certain stroke pixel pair in the gray value and gradient attribute of the character candidate area;

Calculate the symmetrical feature description of all stroke line segments in the character candidate area on the stroke width value, stroke sequence value distribution and low-frequency pattern attributes;

The eigenvalues of different symmetric attributes are connected to form the MSSH feature, and the specific formula is expressed as follows:

F _m (e _i )=[F _j |j=V, G _m , G _o , Sw, Md, Pa]

Among them: F _m (e _i ) represents the value of the MSSH feature vector; [] represents the vector connection operation; e _i represents the ith character candidate region; F _j represents the feature vector corresponding to the symmetric attribute; j represents the specific type of the symmetric attribute; V represents the gray value; G _m represents the gradient size attribute; G _o represents the gradient direction attribute; Sw represents the stroke width value; Md represents the stroke sequence value distribution; hu a represents the low-frequency mode attribute.

5. text detection method according to claim 4, is characterized in that, the specific method that obtains the stroke pixel pair of character candidate area and stroke line segment is as follows:

Use the Canny edge detection operator to output the edge image;

Calculate the gradient direction of a certain pixel point p on the stroke edge image;

Follow the ray r determined by the gradient direction until the ray meets another stroke edge pixel point q;

A stroke pixel pair is defined as {p, q}, and a stroke segment is defined as the distance of ray r between pixel points p and q.

6. the text detection method according to claim 4, is characterized in that, the concrete method of calculating depth convolution feature is as follows:

Resize the character candidate area to a 64×64 pixel value;

Construct a three-stage convolutional neural network model;

The one-stage construction method is as follows:

In the first stage, two convolutional layers and a maximum pooling layer are used sequentially. Among them, the convolutional layer uses 32 convolution kernels with a size of 3×3, and 1 pixel is the displacement offset, which is convolved with the character candidate area. Product operation, the specific formula is as follows:

<mrow><mi>g</mi><mrow><mo>(</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>k</mi><mo>)</mo></mrow><mo>=</mo><munder><mo>&Sigma;</mo><mrow><mi>m</mi><mo>,</mo><mi>n</mi><mo>&Element;</mo><mo>{</mo><mo>-</mo><mn>1</mn><mo>,</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo>}</mo></mrow></munder><msub><mi>e</mi><mi>i</mi></msub><mrow><mo>(</mo><mi>a</mi><mo>+</mo><mi>m</mi><mo>,</mo><mi>b</mi><mo>+</mo><mi>n</mi><mo>)</mo></mrow><msub><mi>h</mi><mi>k</mi></msub><mrow><mo>(</mo><mi>m</mi><mo>,</mo><mi>n</mi><mo>)</mo></mrow></mrow>

Among them, g(a, b, k) represents the value of the pixel value of row a and column b in the character candidate area after the kth convolution operation; ei(a+m, b+n) represents the i-th character candidate area In the (a+m) row and the (b+n) column pixel value; m represents the row offset of the pixel, n represents the column offset of the pixel, and its value set is {-1, 0, 1}, h _k represents the kth convolution kernel; after each convolution layer operation, the activation value is calculated using a nonlinear activation function, the specific formula is as follows:

f(a,b,k)=max(0,g(a,b,k))

f(a, b, k) represents the activation value of the pixel value of row a and column b in the character candidate area after the k convolution operation; max() represents a large-value function;

The activation value is then passed to the maximum pooling layer, which uses 2 pixels as the stride and takes the maximum value in the 2×2 spatial neighborhood as the output value;

The structure of the second stage is the same as that of the first stage;

The three-stage sequence uses three convolutional layers, a maximum pooling layer, and a fully connected layer. The fully connected layer connects the output of the maximum pooling layer into a one-dimensional vector as input, and controls the output to 128 dimensions. The formula can be expressed as follows:

F _d ＝W·X+B

Among them: F _d is the generated 128-dimensional deep convolution feature, X is the one-dimensional vector obtained after connecting the output of the maximum pooling layer, W is the weight matrix, and B is the offset vector;

The convolutional neural network model is trained and tested, the values of unknown parameters h _k , W and B are determined through training, and the F _d generated by the test is used as the deep convolution feature of the character candidate area.

7. The text detection method according to claim 6, wherein the method for obtaining fusion features is as follows:

Use the weight ω _d of the convolutional neural network model after training as the initial fusion weight value of the deep convolution feature F _d ;

For the MSSH feature F _m , the logistic regression model is used to predict its initial fusion weight value ω _m and reduce its feature dimension size. The specific process can be expressed by the following formula:

<mrow><mo>{</mo><mover><msub><mi>F</mi><mi>m</mi></msub><mo>~</mo></mover><mrow><mo>(</mo><msub><mi>e</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>,</mo><msub><mi>&omega;</mi><mi>m</mi></msub><mo>}</mo><mo>=</mo><msub><mi>f</mi><mi>&tau;</mi></msub><mrow><mo>(</mo><msub><mi>F</mi><mi>m</mi></msub><mo>;</mo><mi>D</mi><mo>)</mo></mrow></mrow>

in, Represents the MSSH feature after dimensionality reduction, e _i represents the i-th character candidate region; the function f _τ () represents the logistic regression model, and D represents the small data set used to train the initial weight value of the feature;

The specific process of generating the fusion feature _FS can be expressed by the following formula:

<mrow><msub><mi>F</mi><mi>s</mi></msub><mrow><mo>(</mo><msub><mi>e</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><msub><mi>f</mi><mi>&mu;</mi></msub><mrow><mo>(</mo><msub><mi>&omega;</mi><mi>d</mi></msub><mo>,</mo><msub><mi>F</mi><mi>d</mi></msub><mo>(</mo><msub><mi>e</mi><mi>i</mi></msub><mo>)</mo><mo>,</mo><msub><mi>&omega;</mi><mi>m</mi></msub><mo>,</mo><mover><msub><mi>F</mi><mi>m</mi></msub><mo>~</mo></mover><mo>(</mo><msub><mi>e</mi><mi>i</mi></msub><mo>)</mo><mo>)</mo></mrow></mrow>

Among them, the function f _μ () represents the self-encoder network, and F _d are consistent in dimension.

8. The text detection method according to claim 7, wherein during the fusion training process, when the verification error rate stops decreasing, the joint training process of the autoencoder network ends.

9. text detection method according to claim 1, is characterized in that, the concrete method of merging character area is as follows:

Assuming that the character area is S, calculate the center point c _i of all s _i ∈ S;

For any two-character area s _i , s _j ∈ S, if the Euler distance between the two center points is less than the threshold F, connect a straight line l _{i, i} between the two center points;

Calculate the angle α between all straight lines and the horizontal line, and take the mode α _mode of all the included angles; keep the straight lines with included angles in the interval [α _mode -π/6, α _mode +π/6], and remove the rest of the straight lines;

The character regions connected by straight lines are merged to obtain the final text region.