CN110533088A

CN110533088A - A kind of scene text Language Identification based on differentiated convolutional neural networks

Info

Publication number: CN110533088A
Application number: CN201910759386.7A
Authority: CN
Inventors: 王春枝; 袁野; 叶志伟; 严灵毓; 李敏; 夏慧玲
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2019-12-03

Abstract

The invention discloses a kind of scene text Language Identification based on differentiated convolutional neural networks, traditional image classification method often carry out integral analysis to image, lack the clearly capture to details, cannot handle problems well.In order to solve the above-mentioned technical problem, propose the model and correlation method of a kind of entitled " differentiated convolutional neural networks ", pass through " differentiated cluster " (discriminative clustering) algorithm, study has the local feature of distinction to one group " distinction pattern " (discriminative patterns) in the depth convolution feature of image.Finally, being indicated generally at for image is classified with two layers of fully connected network network layers.Traditional image classification method before the present invention can significantly improve is identified under the condition elements such as font, noise, illumination and is come with some shortcomings.

Description

A kind of scene text Language Identification based on differentiated convolutional neural networks

Technical field

The invention belongs to deep learning technical field of character recognition, and in particular to one kind is based on differentiated convolutional neural networks Scene text Language Identification.

Background technique

Multi-language environment is very universal in modern society.It can usually be encountered not in airport, railway station, hotels and other places The case where occurring simultaneously with spoken and written languages.Languages itself are important information.In addition, different spoken and written languages have it is completely different Characteristic, processing often needs to according to languages classification using targeted model and processing method, therefore, in multi-language environment In, languages identification has great significance.

Languages identification is the important component of traditional optical character recognition (OCR) system.Since it is in multi-language environment In importance, the problem in the past few decades in obtained extensive research.In recent years, with multi-medium data, especially It is the continuous growth of the picture of mobile device acquisition, the importance of scene Text region more highlights, and leads in computer vision Domain causes the upsurge of research.Just because of this, the languages identification of scene text also becomes indispensable.

Previous languages recognizer is designed mainly for document picture or video caption.Document picture and video caption Background and prospect all relative cleans, noise jamming it is small.Therefore, the method based on binaryzation, region segmentation, morphological analysis etc. Often it is used in such work.However, such method is often because back can not be coped with when being applied on scene text Complexity and the variability of the factors such as scape, font, noise, illumination condition and be difficult to be competent at.

The task of scene text languages identification be in given picture languages classification belonging to text in prediction (English, Chinese, Greek etc.).The problem can naturally enough be considered as a kind of image classification problem.In recent years, convolutional neural networks (CNN) good solution is provided using its powerful learning ability and generalization ability as image classification.Nonetheless, languages Identification still has unique challenge because of its own characteristic.And traditional image classification method often carries out entirety to image Property analysis, lack the clearly capture to details, problems cannot be handled well.

Summary of the invention

In order to solve the above-mentioned technical problem the present invention, proposes a kind of scene text based on differentiated convolutional neural networks Recognition methods, traditional image classification method before capable of significantly improving identifies under the condition elements such as font, noise, illumination to be deposited In some shortcomings.

The technical scheme adopted by the invention is that: a kind of scene text languages identification based on differentiated convolutional neural networks Method, which comprises the following steps:

Step 1: building languages identify convolutional neural networks model Disc CNN；

Step 1.1: obtaining data set, the data set is made of several pictures, has a kind of language inside every picture Text constitutes several spoken and written languages together；The data set is divided into training set and test set；

Step 1.2: the picture in training set being cut according to default size, based on the scene character image after cutting I extracts convolution feature level based on convolutional neural networks from scene character image IWherein h^lIndicate first of feature The characteristic pattern extracted on layer, L are characterized the number of layer；

Step 1.3: extracting intensive local feature from convolution feature level using convolutional neural networks；

Step 1.4: the intensive local feature extracted to convolutional neural networks distinguishes formula cluster, and study obtains one Differentiated code book；

Step 1.5: each intensive local feature being encoded using the differentiated code book, then by intensive local feature Coding result fusion, obtain full figure fixed dimension vector indicate, be denoted as full figure description son；

Step 1.6: building languages identify convolutional neural networks model Disc CNN；

Disc CNN includes a hidden layer, and activation primitive ReLU, output layer includes C node；Output layer activation value The probability value in C languages classification is obtained after SoftMax is operated；Wherein Disc CNN uses differentiated middle layer coding staff Method learns to obtain, and is finely tuned in the end-to-end training of overall model for once linear operation plus a ReLU class；It is complete Whole Disc CNN is the end-to-end model that can be trained, and carries out global optimization using gradient descent method；

Step 1.7: parameter migration, end-to-end optimization；

By the back transfer algorithm of neural network, the error gradient of rear class is fed back to prime, then uses gradient descent method Parameter adjustment is carried out simultaneously to front stage；

Step 2: the identification of text languages being carried out to picture using languages identification convolutional neural networks model Disc CNN；

It includes four convolutional layers: convl, conv2, conv, 3, conv4 that Disc CNN network, which has altogether,；The characteristic pattern of output Height is respectively 15,7,3,1, and width is determined by input picture width；Distinction segment is found and middle layer is indicated in conv2, It is carried out on the characteristic pattern of conv3, conv4 output.

The beneficial effects of the present invention are: the present invention is special by identifying that problem proposes that one kind combines depth convolution to languages Sign, differentiated middle layer indicate, the image representing method of spatial pyramid method, and image representing method is modeled as end-to-end to instruct Experienced neural network model advanced optimizes model parameter by end-to-end training.

Detailed description of the invention

Fig. 1 is the general flow chart of the embodiment of the present invention；

Fig. 2 is the detail differences schematic diagram of the scene Text region of the embodiment of the present invention；

Fig. 3 is the Disc CNN neural network structure schematic diagram of the embodiment of the present invention；

Fig. 4 is the test Text region picture schematic diagram of the embodiment of the present invention.

Specific embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

The task of scene text languages identification be in given picture languages classification belonging to text in prediction (English, Chinese, Greek etc.).The problem can naturally enough be considered as a kind of image classification problem.In recent years, convolutional neural networks (CNN) good solution is provided using its powerful learning ability and generalization ability as image classification.Nonetheless, languages Identification still has unique challenge because of its own characteristic.And traditional image classification method often carries out entirety to image Property analysis, lack the clearly capture to details, problems cannot be handled well.In order to solve the above-mentioned technical problem, it proposes A kind of model and correlation method of entitled " differentiated convolutional neural networks ", traditional image point before capable of significantly improving Class method identifies under the condition elements such as font, noise, illumination to come with some shortcomings.This method passes through " differentiated cluster " (discriminative clustering) algorithm, study is to one group " distinction pattern " in the depth convolution feature of image (discriminative patterns), that is, have the local feature of distinction.Finally, being indicated generally at for image is connected entirely with two layers Network layer is connect to classify.

Referring to Fig.1, a kind of scene text Language Identification based on differentiated convolutional neural networks provided by the invention, The following steps are included:

Step 1.1: data set is obtained, data set is made of several pictures, there are a kind of spoken and written languages inside every picture, Several spoken and written languages are constituted together；Data set is divided into training set and test set；

It is the detail differences schematic diagram of the scene Text region of the present embodiment see Fig. 2.

In the present embodiment, feature level is extracted by a convolutional neural networks CNN Jing Guo pre-training；

The picture of input is scaled to fixed height first, is then input in CNN；CNN passes through to input Picture carries out a series of convolution sum maximum pondization operation to extract feature；

WithConvolution kernel and amount of bias are respectively indicated, then the process of feature extraction are as follows:

h^l=pool_max(σ(h^l-1*k^l+b^l))；

Wherein, pool_maxIndicate maximum pondization operation, σ is any nonlinear activation function, and * indicates convolution operation, h⁰Table Show input initial picture, h^lThe l in the middle upper right corner indicates the coding of input picture；The step-length of all convolution operations is disposed as 1, volume For product core having a size of 3x3, Boundary filling size is set as 1x1.

It is done with given image to convolution for a convolution kernel can extract a kind of feature of text in image, no Same convolution kernel can extract different characteristics of image.Overview says that the calculation method of convolutional layer is exactly according to formula:

Wherein " σ " indicates activation primitive；" imgMat " expression gray level image matrix；" W " expression convolution kernel；" ο " expression volume Product operation；" b " expression bias.

Step 1.4: the intensive local feature extracted to convolutional neural networks distinguishes formula cluster, and study obtains one Differentiated code book (discriminative codebook)；

In the present embodiment, the specific implementation of step 1.4 includes following sub-step:

Step 1.4.1: the intensive local feature extracted to convolutional neural networks distinguishes formula cluster；

Differentiated cluster carries out respectively in each languages classification and feature level；It was found that collection is the picture of classification c in layer The local feature set extracted on grade l, is usedIt indicates；Naturally collection is then the local feature that every other classification picture extracts Set；

Step 1.4.2: study obtains a differentiated code book；

Differentiated code book is made of one group of linear classifier；Each classifier is equivalent to the detection of a distinction segment Device, its output illustrate the response for distinguishing sexual norm；Wherein, languages are distinguished to need to capture the area, topography for having distinction Domain, this region is also referred to as distinction segment, as a kind of differentiation sexual norm.

Step 1.5: each intensive local feature being encoded using differentiated code book, then by the volume of intensive local feature Code result fusion, the fixed dimension vector for obtaining full figure indicate, are denoted as full figure description；

In the present embodiment, it is assumed that intensive local feature diagram shape is w₁×h₁×n₁, w₁、h₁、n₁Indicate characteristic pattern shape Three dimensions；Each position of characteristic pattern has corresponded to a local description, therefore w × h can be extracted from characteristic pattern Local description, each is the vector of m dimension；Use h^l[i, j] come indicate the description on the i-th row jth column position, the description son It is encoded by the code book comprising k classification, obtains vector z [i, j]: z of k dimension^l[i, j]=max (0, W^lx^l[i, j]+b^l)； Wherein, W^lx^l[i, j]+b^lIt is l layers of description in code book W^l, b^lResponse after coding, by max (0) by the negative sound in response Zero setting is answered, therefore coding result is the non-Negative Acknowledgment of code book；

The coding result of local description is full figure description by horizontal space pyramid pond HSPP operating polymerization；It will Local code value is divided into several sub-regions according to spatial position；And do horizontal maximum pondization operation in that region respectively； Assuming that local code value, by indicating, horizontal maximum pondization operation chooses maximum encoded radio as pond by dimension in each row As a result, i.e. hspp (z^l)=maxz^l[i, j]；Wherein, local code value is by z^l[i, j] is indicated；

Spatial sub-area divides on the longitudinal direction of characteristic pattern, i.e., characteristic pattern is divided into contour several pieces, width and Former characteristic pattern is consistent；Each subgraph is by equation hspp (z^l)=maxz^l[i, j] carries out the operation of horizontal space pondization respectively, most The pond result obtained afterwards obtains description of full figure by splicing.

See Fig. 3, the Disc CNN of the present embodiment includes a hidden layer, and activation primitive ReLU, output layer includes C Node；Output layer activation value obtains the probability value in C languages classification after SoftMax is operated；Wherein Disc CNN is used The coding method of differentiated middle layer learns to obtain for once linear operation plus a ReLU class, and in the end-to-end of overall model It is finely tuned in training；Complete Disc CNN is the end-to-end model that can be trained, and is carried out using gradient descent method whole excellent Change；

Step 1.7: parameter migration, end-to-end optimization；

In the present embodiment, the fine tuning optimization algorithm of Disc CNN is stochastic gradient descent (SGD).Initial learning rate is set as 10^-3, momentum 0.9, batch size is 128.In fcl, network uses dropout mechanism to avoid trained over-fitting.

Feature extraction network carries out pre-training in another individual CNN model.

Disc CNN realizes by C++ and Python programming language

The present invention passes through the application using a kind of entitled differentiated convolutional neural networks on scene Text region, passes through area Fraction clustering algorithm, study is to one group " distinction pattern " in the depth convolution feature of image, finally, image is indicated generally at Classified with two layers of fully connected network network layers.

The present invention can significantly improve before traditional image classification method under the condition elements such as font, noise, illumination Identification comes with some shortcomings.See Fig. 4, it is the test Text region picture schematic diagram of the embodiment of the present invention, is to be acquired in figure Some data sets, the substantially picture picture that is all the natural scene for the use looked on the net, all texts font, color, Strong variation is suffered from typesetting, writing style, in addition, background is also to be frequently subjected to illumination, the influence of camera angle etc., It is relatively mixed and disorderly.

For the present invention see the following table 1 compared with conventional method accuracy, accuracy of the invention is apparently higher than traditional method.

Table 1

It should be understood that the part that this specification does not elaborate belongs to the prior art；It is above-mentioned to implement for preferable The description of example is more detailed, therefore can not be considered the limitation to the invention patent protection scope, the common skill of this field Art personnel under the inspiration of the present invention, in the case where not departing from the ambit that the claims in the present invention are protected, can also make and replace It changes or deforms, fall within the scope of protection of the present invention, it is of the invention range is claimed to be determined by the appended claims.

Claims

1. a kind of scene text Language Identification based on differentiated convolutional neural networks, which is characterized in that including following step It is rapid:

Step 1.1: data set is obtained, the data set is made of several pictures, there are a kind of spoken and written languages inside every picture, Several spoken and written languages are constituted together；The data set is divided into training set and test set；

Step 1.2: the picture in training set being cut according to default size, based on the scene character image I after cutting, base Convolution feature level is extracted from scene character image I in convolutional neural networksWherein h^lIt indicates on first of characteristic layer The characteristic pattern of extraction, L are characterized the number of layer；

Step 1.4: the intensive local feature extracted to convolutional neural networks distinguishes formula cluster, and study obtains a differentiation Formula code book；

Step 1.5: each intensive local feature being encoded using the differentiated code book, then by the volume of intensive local feature Code result fusion, the fixed dimension vector for obtaining full figure indicate, are denoted as full figure description；

Disc CNN includes a hidden layer, and activation primitive ReLU, output layer includes C node；Output layer activation value passes through The probability value in C languages classification is obtained after SoftMax operation；Wherein Disc CNN uses the coding method of differentiated middle layer, is Once linear operation learns to obtain plus a ReLU class, and is finely tuned in the end-to-end training of overall model；Completely Disc CNN is the end-to-end model that can be trained, and carries out global optimization using gradient descent method；

Step 1.7: parameter migration, end-to-end optimization；

By the back transfer algorithm of neural network, the error gradient of rear class is fed back to prime, then with gradient descent method to preceding Rear class carries out parameter adjustment simultaneously；

It includes four convolutional layers: convl, conv2, conv, 3, conv4 that Disc CNN network, which has altogether,；The characteristic pattern height of output Respectively 15,7,3,1, width is determined by input picture width；Distinction segment is found and middle layer is indicated in conv2, conv3, It is carried out on the characteristic pattern of conv4 output.

2. the scene text Language Identification according to claim 1 based on differentiated convolutional neural networks, feature Be: in step 1.2, feature level is extracted by a convolutional neural networks CNN Jing Guo pre-training；

The picture of input is scaled to fixed height first, is then input in CNN；CNN passes through to input picture A series of convolution sum maximum pondization operation is carried out to extract feature；

WithWithConvolution kernel and amount of bias are respectively indicated, then the process of feature extraction are as follows:

h^l=pool_max(σ(h^l-1*k^l+b^l))；

Wherein, pool_maxIndicate maximum pondization operation, σ is any nonlinear activation function, and * indicates convolution operation, h⁰Indicate input Initial picture, h^lThe l in the middle upper right corner indicates the coding of input picture；The step-length of all convolution operations is disposed as 1, convolution kernel ruler Very little is 3x3, and Boundary filling size is set as 1x1.

3. the scene text Language Identification according to claim 1 based on differentiated convolutional neural networks, feature It is, the specific implementation of step 1.4 includes following sub-step:

Differentiated cluster carries out respectively in each languages classification and feature level；It was found that collection is the picture of classification c on level l Obtained local feature set is extracted, is usedIt indicates；Naturally collection is then the local feature set that every other classification picture extracts；

Step 1.4.2: study obtains a differentiated code book；

The differentiated code book is made of one group of linear classifier；Each classifier is equivalent to the detection of a distinction segment Device, its output illustrate the response for distinguishing sexual norm；Wherein, languages are distinguished to need to capture the area, topography for having distinction Domain, this region is also referred to as distinction segment, as a kind of differentiation sexual norm.

4. the scene text Language Identification according to claim 1 based on differentiated convolutional neural networks, feature It is: in step 1.5, it is assumed that intensive local feature diagram shape is w₁×h₁×n₁, w₁、h₁、n₁Indicate three of characteristic pattern shape Dimension；Each position of characteristic pattern has corresponded to a local description, therefore w × h part can be extracted from characteristic pattern Description, each is the vector of m dimension；Use h^l[i, j] come indicate the description on the i-th row jth column position, description is by wrapping Code book containing k classification is encoded, and vector z [i, j]: z of k dimension is obtained^l[i, j]=max (0, W^lx^l[i, j]+b^l)；Wherein, W^lx^l[i, j]+b^lIt is l layers of description in code book W^l, b^lResponse after coding is set the Negative Acknowledgment in response by max (0) Zero, therefore coding result is the non-Negative Acknowledgment of code book.

5. the scene text Language Identification according to claim 4 based on differentiated convolutional neural networks, feature Be: in step 1.5, the coding result of local description is retouched by horizontal space pyramid pond HSPP operating polymerization for full figure State son；Local code value is divided into several sub-regions according to spatial position；And do horizontal maximum pond in that region respectively Change operation；Assuming that local code value, by indicating, horizontal maximum pondization operation chooses maximum encoded radio by dimension in each row As pond as a result, i.e. hspp (z^l)=maxz^l[i, j]；Wherein, local code value is by z^l[i, j] is indicated；

Spatial sub-area divides on the longitudinal direction of characteristic pattern, i.e., characteristic pattern is divided into contour several pieces, width and Yuan Te Sign figure is consistent；Each subgraph is by equation hspp (z^l)=maxz^l[i, j] carries out the operation of horizontal space pondization respectively, finally To pond result by splicing obtain full figure description son.