CN110135419A

CN110135419A - End-to-end text recognition method under a kind of natural scene

Info

Publication number: CN110135419A
Application number: CN201910371620.9A
Authority: CN
Inventors: 李武军; 陈雨
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-05-06
Filing date: 2019-05-06
Publication date: 2019-08-16
Anticipated expiration: 2039-05-06
Also published as: CN110135419B

Abstract

The invention discloses end-to-end text recognition methods under a kind of natural scene, including text filed and content is predicted with natural scene picture and authentic signature training frame and to natural scene on piece: in the training stage, collect the picture under the natural scene comprising text, building includes the data set of text position and content, the end-to-end text identification frame of definition standard, training detection part is marked using true detection, use neighbour's relevant border optimization algorithm optimizing detection region, using the detection zone after optimization input into identification division to train identification division parameter, trained frame parameter is saved to data platform；In test phase, read trained frame parameter, input test image, detection-phase detect it is text filed, using based on neighbour's coherency boundary's optimization algorithm optimizing detection region, by after optimization detection zone be sent into identification division carry out text identification.

Description

End-to-end text recognition method under a kind of natural scene

Technical field

The present invention relates to end-to-end text identification sides under a kind of natural scene based on neighbour's coherency boundary's optimization algorithm Method, is related to end-to-end text identification under natural scene, inaccurately leads to asking for recognition failures especially suitable for detection zone boundary Topic.

Background technique

End-to-end text identification task objective under natural scene is that input one includes text filed natural scene Piece should detect Pictures location, also identify corresponding position content of text.In end-to-end text identification task, identification The influence of the examined stage accuracy of the accuracy in stage is very high, and only detection-phase has accurately framed word all in text Mother, cognitive phase could export accurate recognition result.Particularly, existing end-to-end text frame is for long text or big text The Boundary Prediction inaccuracy in region, this brings certain difficulty to subsequent identification mission.

Existing common post-processing algorithm such as non-maximum restraining (Non-Maximum Suppression, abbreviation NMS) is calculated Method or local sensing non-maximum restraining (Locality-Aware NMS, abbreviation LANMS) algorithm, can only will be adjacent and hand over and compare Big region, which is done, to be merged, and is not required the accuracy on boundary, and inaccurate boundary is likely to be obtained which results in detection process, To influence recognition result.

Summary of the invention

Goal of the invention: in current end-to-end text identification frame, the boundary accuracy of testing result is not defined It is required that available frame boundary usual for the testing result of long text, big text is inaccurate, even without by the complete frame of text Out, which results in the inaccuracy of recognition result.In view of the above-mentioned problems, the present invention devises the optimization of the boundary based on neighbour's correlation Algorithm, has invented the end-to-end text identification deep learning frame using the algorithm, and method describes frame structure, frame training Process, framework test process solve the problems, such as that Boundary Prediction is inaccurate with this, improve the precision of end-to-end task.

Technical solution: end-to-end text recognition method under a kind of natural scene, including optimized based on neighbour coherency boundary The end-to-end text identification deep learning frame training of algorithm, and using trained frame to text filed in natural scene And content carries out the test process of end-to-end identification.

The end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm is trained specific Step are as follows:

Step 100, natural scene image, authentic signature region, authentic signature string to data processing platform (DPP) are inputted；

Step 101, input natural scene picture is pre-processed, carries out the operation such as Random-Rotation, sampling, normalization；

Step 102, use the true class figure of authentic signature Area generation and true geometric figure using as training supervision message；

Step 103, sharing feature part, the detection part, the weight of identification division each section of entire frame are initialized；

Step 104, in data processing platform (DPP), natural scene image, true class figure, true geometric figure, true mark are used Note string, with the entire frame of the training of method end to end；It the steps include: that natural scene image first passes around sharing feature part, obtain To sharing feature figure；Detection part generates testing result using sharing feature figure；The optimization inspection of neighbour's coherency boundary's optimization algorithm Survey result；The bilinear interpolation acted on sharing feature figure will test area sampling and obtain identification feature；Identification division utilizes The identification feature of input obtains recognition result；

Step 105, export and save the storage system of frame each section weight to data processing platform (DPP).

It is right using the trained end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm Text filed and content carries out the test of end-to-end identification in natural scene, tests specific steps are as follows:

Step 200, natural scene image is inputted to data processing platform (DPP)；

Step 201, read trained frame each section weight for having saved, including sharing feature part, detection part, The weight of identification division each section；

Step 202, natural scene image first passes around sharing feature part, obtains sharing feature figure；Detection part utilizes Sharing feature figure generates testing result；Neighbour's coherency boundary's optimization algorithm optimizing detection result；It acts on sharing feature figure Bilinear interpolation will test area sampling and obtain identification feature；Identification division obtains identification knot using the identification feature of input Fruit.

The end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm, wherein sharing Characteristic extracts sharing feature using the U-shaped framework based on residual error neural network；U-shaped framework using the first coding module with The mode that first decoder module connects in succession obtains sharing feature；

First coding module includes the down-sampling structure between the convolutional coding structure of multilayer convolutional coding structure and adjacent layer, institute Down-sampling structure is stated for carrying out down-sampling to the characteristic pattern of the upper layer convolutional coding structure output in the convolutional coding structure of adjacent layer and inciting somebody to action The characteristic pattern of down-sampling inputs lower layer's convolutional coding structure in adjacent convolutional coding structure；

First decoder module includes the up-sampling structure between the convolutional coding structure of multilayer convolutional coding structure and adjacent layer, institute State up-sampling structure be used for in the convolutional coding structure of adjacent layer lower layer's convolutional coding structure output characteristic pattern up-sampled and incite somebody to action The characteristic pattern of up-sampling inputs the upper layer convolutional coding structure in adjacent convolutional coding structure.

The class figure and geometric graph that convolution for several times generates prediction is respectively adopted in the detection part in sharing feature.

The boundary optimization algorithm based on neighbour's correlation, it is contemplated that the point on characteristic pattern is to close.Input is inspection Survey the class figure F of fractional prediction_scoreWith geometric graph F_geo, obtained according to class figure with geometric graph single text filedScore threshold s_t, depend on distance threshold r_tConfidence level function f_c；It the steps include:

Step 501, for single text filedIt obtains being pertaining only to the region, and in class figure F_scoreOn class probability Greater than s_tPoint set

Step 502, rightMiddle every bit p, calculate this away from region, it is right, under, the distance on left four sides

Step 503, according to distanceAnd confidence level function f_c, calculate confidence level

Step 504, rightMiddle every bit p and geometric graph F_geo, calculate the region that the point itself is predicted

Step 505, according toThe respective confidence level of middle all the pointsAnd the region of predictionIt is logical It crosses average weighted process and calculates final region

Weighted mean procedure described in algorithm calculates final areaProcess, it is assumed that useIndicate regionAn apex coordinate, region is quadrangle, with i=1,2,3,4 respectively indicate the upper left corner in region, the upper right corner, the lower right corner, Four, lower left corner vertex, then the weighting procedure of coordinate can be described with following formula:

Confidence level function f described in algorithm_cDesign, can be used following form:

The identification division obtains prediction text in such a way that the second coding module is connect in succession with the second decoder module This string；Wherein the second coding module includes the down-sampling structure between multilayer convolutional coding structure and adjacent convolutional coding structure, the second decoding Module is used based on long Memory Neural Networks structure in short-term.

The bilinear interpolation sampling section is found corresponding for a testing result region on sharing feature figure Position, carry out bilinear interpolation sampling to it, obtain identification feature figure.

The utility model has the advantages that compared with prior art, the end provided by the invention based on neighbour's coherency boundary's optimization algorithm is arrived Text recognition method is held, point on characteristic pattern has been used and the essence on testing result boundary is improved to the accurate property of neighbor prediction Degree, to improve the result of end-to-end task.

Detailed description of the invention

Fig. 1 is the flow chart based on neighbour's coherency boundary's optimization algorithm that the present invention is implemented；

Fig. 2 is the end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm that the present invention designs In the sharing feature layer of frame, the first decoder module and U-shaped network diagram；

Fig. 3 is the end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm that the present invention designs Frame training process flow chart；

Fig. 4 is the flow chart of the frame of specifically used learning algorithm training；

Fig. 5 is the end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm that the present invention designs Frame test process flow chart.

Specific embodiment

Combined with specific embodiments below, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to various equivalences of the invention The modification of form falls within the application range as defined in the appended claims.

End-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm, structure are divided into shared spy Levy several parts such as part, detection part, boundary optimization algorithm part, bilinear interpolation sampling section, identification division.

Sharing feature part can be used the U-shaped framework based on residual error neural network and extract sharing feature；U-shaped framework is using the One coding module obtains sharing feature with the mode that the first decoder module is connect in succession；First coding module includes multilayer convolution knot Down-sampling structure between the convolutional coding structure of structure and adjacent layer, down-sampling structure are used for the upper layer in the convolutional coding structure of adjacent layer The characteristic pattern of convolutional coding structure output carries out down-sampling and the characteristic pattern of down-sampling is inputted to lower layer's convolution in adjacent convolutional coding structure Structure；First decoder module includes the up-sampling structure between the convolutional coding structure of multilayer convolutional coding structure and adjacent layer, up-sampling knot Structure is used to carry out up-sampling and by the spy of up-sampling to the characteristic pattern of lower layer's convolutional coding structure output in the convolutional coding structure of adjacent layer Sign figure inputs the upper layer convolutional coding structure in adjacent convolutional coding structure.

The class figure and geometric graph that convolution for several times generates prediction is respectively adopted in detection part in sharing feature.

Boundary optimization algorithm core concept based on neighbour's correlation is that the prediction to certain boundary only takes the boundary attached The close point point high as confidence level is weighted and averaged.Process is as shown in Figure 1.Input is the class figure F of detection part prediction_scoreWith Geometric graph F_geo, obtained according to class figure with geometric graph single text filedScore threshold s_t, depend on distance threshold r_tSet Belief function f_c；It the steps include:

For single text filedIt obtains being pertaining only to the region, and in class figure F_scoreOn class probability be greater than s_tPoint Collection

It is rightMiddle every bit p, calculate this away from region, it is right, under, the distance on left four sides

According to distanceAnd confidence level function f_c, calculate confidence level

It is rightMiddle every bit p and geometric graph F_geo, calculate the region that the point itself is predicted

According toThe respective confidence level of middle all the pointsAnd the region of predictionIt is flat by weighting Equal process calculates final region

Wherein weighted mean procedure calculates final areaProcess, it is assumed that useIndicate regionOne A apex coordinate, region are quadrangle, and the upper left corner, the upper right corner, the lower right corner, the lower left corner in region are respectively indicated with i=1,2,3,4 Four vertex, then the weighting procedure of coordinate can be described with following formula:

Confidence level function f_cDesign, can be used following form:

Threshold parameter can be chosen according to practical problem, such as desirable s_t=0.7, r_t=0.01.

Identification division obtains prediction text string in such a way that the second coding module is connect in succession with the second decoder module；Its In the second coding module include down-sampling structure between multilayer convolutional coding structure and adjacent convolutional coding structure, the second decoder module uses Based on long Memory Neural Networks structure in short-term.

Bilinear interpolation sampling section finds corresponding position for a testing result region on sharing feature figure, Bilinear interpolation sampling is carried out to it, obtains identification feature figure.

Table 1 is that the end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm shares convolutional layer The first coding module, module is by the down-sampling structure group between the convolutional coding structure of a series of multilayer convolutional coding structure and adjacent layer At: output size is characterized figure in the size of space scale in figure；[n × n, m] represent the convolution kernel size of current convolution kernel as [n × n], port number m；The residual error convolution block of layer 2,3,4,5 can be respectively repeated 3 times.

Table 1

Fig. 2 is that the end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm shares convolutional layer First decoder module and U-shaped network, decoder module include the up-sampling between the convolutional coding structure of multilayer convolutional coding structure and adjacent layer Structure, U-shaped network obtain sharing feature in such a way that the first coding module is connect in succession with the first decoder module: U-shaped in figure The left side of network is the first coding module, and right side is the first decoder module, and conv, concat, upsampling respectively represent volume Product, process channel connection, up-sampled.

Table 2 is the end-to-end text identification deep learning frame identification division based on neighbour's coherency boundary's optimization algorithm Second coding module, module is by the down-sampling structure group between the convolutional coding structure of a series of multilayer convolutional coding structure and adjacent layer At: input layer, convolutional layer, pond layer are respectively represented for input, conv, pool layers in figure.

Table 2

Second of end-to-end text identification deep learning frame identification division based on neighbour's coherency boundary's optimization algorithm Decoder module can be used based on two-way length Memory Neural Networks structure in short-term, input identification feature with this to obtain prediction string.

Fig. 3 is the end-to-end text identification deep learning frame training process based on neighbour's coherency boundary's optimization algorithm Flow chart, training process are described as follows: when training starts, frame initializes sharing feature part, detection part, identification first The parameter (weight) of part three parts；Input a series of corresponding natural scene pictures, real estate position, real text string extremely After data processing platform (DPP), input natural scene picture is pre-processed, carries out the operation such as Random-Rotation, sampling, normalization；According to Real estate position generates true class figure and true geometric figure；Sharing feature layer is shared according to the natural scene picture of input Feature；Sharing feature part after testing obtains prediction class figure and predicts geometric graph, obtains detection zone accordingly；Boundary optimization Algorithm acts in detection zone, the detection zone after obtaining boundary optimization；Detection zone after being optimized according to boundary, bilinearity Interpolation sampling acts in sharing feature, obtains identification feature；Identification feature passes through identification division, obtains prediction text string；In advance It surveys class figure and true class figure, prediction geometric graph and true geometric figure, prediction text string and real text string calculates separately loss, return Pass gradient and undated parameter；As above training terminates until reaching termination condition (being greater than threshold value as updated wheel number) training；Storage instruction The parameter perfected；Terminate.

Fig. 4 is the specifically used learning algorithm training block flow diagram.Steps are as follows: when training starts described in initialization Frame each section parameter；Input natural scene picture, real estate position, real text string；Frame is according to real estate position Generate true class figure, true geometric figure；Frame handles natural scene picture, generates prediction class figure, prediction geometric graph and prediction text This string；Frame measures loss between true class figure and prediction class figure using cross entropy loss function, using handing over and compare loss function And cosine losses function measures loss between true geometric figure and prediction geometric graph, measures true text using ctc loss function It is lost between this string and prediction text string；Frame calculates whole loss；Gradient is returned by back-propagation algorithm；Frame uses SGD algorithm updates each section parameter；Such as reach termination condition (being greater than threshold value as updated wheel number), then storing parameter terminates；If not Reach, then inputs new natural scene picture, real estate position, real text string, start the training of a new round.

Fig. 5 is the end-to-end text identification deep learning framework test process stream based on neighbour's coherency boundary's optimization algorithm Cheng Tu, test process are described as follows: when test starts, data processing platform (DPP) reads trained each section parameter initialization frame Frame；Read picture to be tested；After picture is via sharing feature layer, sharing feature is obtained；Sharing feature is obtained via detection part To prediction class figure and prediction geometric graph, detection zone is obtained accordingly；Boundary optimization algorithm acts on detection zone, obtains side Detection zone after boundary's optimization, i.e. estimation range；According to estimation range, bilinear interpolation sampling action obtains in sharing feature To identification feature；Identification feature obtains prediction text string via identification division；Estimation range and prediction text string are finally exported, End-to-end text identification task terminates.

Claims

1. end-to-end text recognition method under a kind of natural scene based on neighbour's coherency boundary's optimization algorithm, feature exist In, it is trained including the end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm, and utilize instruction The frame perfected carries out the test process of end-to-end identification to text filed and content in natural scene；

The specific steps of the end-to-end text identification deep learning frame training based on neighbour's coherency boundary's optimization algorithm Are as follows:

Step 101, input natural scene picture is pre-processed；

Step 104, in data processing platform (DPP), using natural scene image, true class figure, true geometric figure, authentic signature string, With the entire frame of the training of method end to end；It the steps include: that natural scene image first passes around sharing feature part, shared Characteristic pattern；Detection part generates testing result using sharing feature figure；Neighbour's coherency boundary's optimization algorithm optimizing detection result； The bilinear interpolation acted on sharing feature figure will test area sampling and obtain identification feature；Identification division utilizes the knowledge inputted Other feature obtains recognition result；

Step 105, export and save the storage system of frame each section parameter to data processing platform (DPP).

2. end-to-end text identification side under the natural scene as described in claim 1 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that utilize the trained end-to-end text identification deep learning based on neighbour's coherency boundary's optimization algorithm Frame carries out the test of end-to-end identification to text filed and content in natural scene, tests specific steps are as follows:

Step 200, natural scene image is inputted to data processing platform (DPP)；

Step 201, the trained frame each section weight saved, including sharing feature part, detection part, identification are read The weight of part each section；

Step 202, natural scene image first passes around sharing feature part, obtains sharing feature figure；Detection part utilizes shared Characteristic pattern generates testing result；Neighbour's coherency boundary's optimization algorithm optimizing detection result；It acts on double on sharing feature figure Linear interpolation will test area sampling and obtain identification feature；Identification division obtains recognition result using the identification feature of input.

3. end-to-end text identification side under the natural scene as described in claim 1 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that the end-to-end text identification deep learning frame based on neighbour's coherency boundary's optimization algorithm, Sharing feature is extracted using the U-shaped framework based on residual error neural network in middle sharing feature part；U-shaped framework is using the first coding Module obtains sharing feature with the mode that the first decoder module is connect in succession；

First coding module includes the down-sampling structure between the convolutional coding structure of multilayer convolutional coding structure and adjacent layer, under described Sampling structure is used to carry out down-sampling to the characteristic pattern of the upper layer convolutional coding structure output in the convolutional coding structure of adjacent layer and adopt by under The characteristic pattern of sample inputs lower layer's convolutional coding structure in adjacent convolutional coding structure；

First decoder module includes the up-sampling structure between the convolutional coding structure of multilayer convolutional coding structure and adjacent layer, it is described on Sampling structure is used to up-sample and adopt by to the characteristic pattern of lower layer's convolutional coding structure output in the convolutional coding structure of adjacent layer The characteristic pattern of sample inputs the upper layer convolutional coding structure in adjacent convolutional coding structure.

4. end-to-end text identification side under the natural scene as claimed in claim 2 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that the class figure and geometry that convolution for several times generates prediction is respectively adopted in the detection part in sharing feature Figure.

5. end-to-end text identification side under the natural scene as described in claim 1 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that the boundary optimization algorithm based on neighbour's correlation, it is contemplated that the point on characteristic pattern is to close.It is defeated Enter the class figure F for detection part prediction_scoreWith geometric graph F_geo, obtained according to class figure with geometric graph single text filedPoint Number threshold value s_t, depend on distance threshold r_tConfidence level function f_c；It the steps include:

Step 501, for single text filedIt obtains being pertaining only to the region, and in class figure F_scoreOn class probability be greater than s_t Point set

Step 505, according toThe respective confidence level of middle all the pointsAnd the region of predictionBy adding The process of weight average calculates final region

6. end-to-end text identification side under the natural scene as claimed in claim 5 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that weighted mean procedure described in algorithm calculates final areaProcess, it is assumed that useTable Show regionAn apex coordinate, region is quadrangle, with i=1,2,3,4 respectively indicate the upper left corner in region, the upper right corner, The lower right corner, four, lower left corner vertex, then the weighting procedure of coordinate can be described with following formula:

7. end-to-end text identification side under the natural scene as claimed in claim 5 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that confidence level function f described in algorithm_cDesign, can be used following form:

8. end-to-end text identification side under the natural scene as claimed in claim 2 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that the identification division obtains in such a way that the second coding module is connect in succession with the second decoder module Predict text string；Wherein the second coding module includes the down-sampling structure between multilayer convolutional coding structure and adjacent convolutional coding structure, the Two decoder modules are used based on long Memory Neural Networks structure in short-term.

9. end-to-end text identification side under the natural scene as described in claim 1 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that the bilinear interpolation sampling section is looked on sharing feature figure for a testing result region To corresponding position, bilinear interpolation sampling is carried out to it, obtains identification feature figure.

10. end-to-end text identification under the natural scene as described in claim 1 based on neighbour's coherency boundary's optimization algorithm Method, which is characterized in that be trained to via following steps:

Step 701, forward process is carried out to natural scene image；

Step 702, the error of prediction class figure and true class figure is calculated using cross entropy loss function；Using handing over and compare loss function The error of prediction geometric graph and true geometric figure is calculated with cosine similarity function；Using CTC loss function calculate prediction string with The error really gone here and there；

Step 703, parameter gradients are obtained using back-propagation algorithm, ginseng is updated using optimization algorithm such as stochastic gradient descent algorithm Number gradient.