CN106570497A

CN106570497A - Text detection method and device for scene image

Info

Publication number: CN106570497A
Application number: CN201610878795.5A
Authority: CN
Inventors: 乔宇; 黄韡林; 田值; 贺通; 贺盼
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2016-10-08
Filing date: 2016-10-08
Publication date: 2017-04-19

Abstract

The invention provides a text detection method for a scene image. The method comprises the steps of obtaining a scene image, and extracting the convolution features of the scene image through a convolutional neural network model; loading the convolution features of the scene into a recursive neural network model to generate a text candidate frame sequence, and subjecting the text candidate frame sequence to post-treatment to obtain a text line region. According to the technical scheme of the invention, the convolution features are trained through the recursive neural network model, and then the context text information of the convolution features are trained. Therefore, the text detection robustness is improved. Meanwhile, the method is not limited to a single language classifier, and can be applied to the text detection process of multiple languages. Moreover, complex priori conditions are not manually preset, so that the detection stability is improved under different scenes. Based on the computing resources of overlapped regions, the computational efficiency is effectively improved. Based on an end-to-end model, the calculating and processing course is simplified.

Description

A kind of Method for text detection and device of scene image

Technical field

The invention belongs to field of image detection, more particularly to a kind of Method for text detection and device of scene image.

Background technology

Word in scene image is effectively recognized, can be people's bringing great convenience property of life.Such as Relevant information of vehicle etc. can be automatically searched according to license plate number to the knowledge figure of the contents such as the car plate in image.And it is accurate It is text filed in true detection and positioning image, it is basis and the premise of unrestricted natural scene Chinese version identification.

At present for the method for text detection, the Method for text detection based on connection component is generally comprised, and based on cunning The Method for text detection of dynamic window.

Wherein, the Method for text detection based on connection component, with a quick method (such as MSER (Maximally Stable Extremal Regions, maximum stable extremal region), SWT (applying paintbrush width) etc.) go to separate text pixel With non-textual pixel.Then text pixel is stroke or character candidates by wolfishly group is beaten using low-level image feature.The bottom for using Layer is generally characterized by gray value, color or gradient etc..Based on the Method for text detection of connection component, to non-interconnected word (ratio Such as be right text) can not effectively be detected.

Based on the Method for text detection of sliding window, one window of slip intensive in the picture is referred to, while in sliding window Each position at applying detection algorithm (using manual designs low-level image feature or CNN (convolutional neural networks)).Based on cunning Although the method for dynamic window not there is a problem of not supporting non-interconnected word, need to process Issues On Multi-scales.In order to solve Issues On Multi-scales, generally require and make good use of the sliding window of several yardsticks and slide respectively over image, can so increase the amount of calculation of text detection.

Because current Method for text detection is normally based on the grader of single character, candidate is acted on sliding window Frame, when scene is complicated, such as illumination, shade, the impact of natural conditions such as blocks, and affects the robustness of character classification, it is impossible to Single treatment includes the image of polyglot text, when detecting under different scenes, the less stable of detection.

The content of the invention

It is an object of the invention to provide a kind of Method for text detection of scene image, is examined with the text for solving prior art In survey method, the poor robustness of character classification, it is impossible to which single treatment includes the image of polyglot text is examined under different scenes The problem of the less stable of survey.

In a first aspect, embodiments providing a kind of Method for text detection of scene image, methods described includes：

Scene image is obtained, by the convolution feature of scene image described in convolutional neural networks model extraction；

The convolution feature of the scene image is sent into into recurrent neural networks model, text candidates frame sequence is generated；

Post processing is carried out to the text candidates frame sequence, line of text region is obtained.

It is described by convolutional neural networks mould with reference in a first aspect, in the first possible implementation of first aspect Type extracts the convolution characterization step of the scene image to be included：

Convolutional neural networks computing is carried out to the scene image by VGG convolutional neural networks, the scene graph is obtained The convolutional layer of picture；

The convolution feature of last convolutional layer of the scene image specific region is obtained using predetermined sliding window.

With reference to second possible implementation of first aspect, in second possible implementation of first aspect, institute State and the convolution feature of the scene image is sent into into recurrent neural networks model, generating text candidates frame sequence step includes：

The convolution feature is sent into into recurrent neural networks model by row, using each convolution feature as shot and long term mould is remembered The time frame input of type is trained, and is fixed the text candidates anchor frame of width；

The lower edges of the text candidates anchor frame of the fixed width are returned, detected and connected, text is generated and is waited Select frame sequence.

With reference in a first aspect, in the third possible implementation of first aspect, waiting to the text of the fixed width The lower edges for selecting anchor frame are returned, detected and connected, and generating text candidates frame sequence step includes：

The supervision message of the text candidates anchor frame is obtained, the supervision message includes：Text candidates anchor frame is text The first offset distance, the text candidates anchor frame distance of score value, text candidates anchor frame apart from oneself nearest text row bound upper end Second offset distance of oneself nearest text row bound lower end；

According to the supervision message of the text candidates anchor frame, score value is selected more than the text candidates anchor frame of predetermined value, knot First offset distance and the second offset distance are closed, text candidates frame sequence is generated.

Second of the first possible implementation, first aspect with reference to first aspect, first aspect may realization side Formula, the third possible implementation of first aspect, it is described to the text in the 4th kind of possible implementation of first aspect This candidate frame sequence carries out post processing, and obtaining line of text region step includes：

According to the difference in height and horizontal range of the text candidates frame sequence, select text box in horizontal edge it Between horizontal range, and in the text candidates frame at vertical edge, generate line of text region.

Second aspect, embodiments provides a kind of text detection device of scene image, and described device includes：

Convolution feature acquiring unit, for obtaining scene image, by scene graph described in convolutional neural networks model extraction The convolution feature of picture；

Text candidates frame signal generating unit, for the convolution feature of the scene image to be sent into into recurrent neural networks model, Generate text candidates frame sequence；

Line of text area acquisition unit, for carrying out post processing to the text candidates frame sequence, obtains line of text area Domain..

With reference to second aspect, in the first possible implementation of second aspect, the convolution feature acquiring unit bag Include：

Convolutional layer obtains subelement, for carrying out convolutional Neural net to the scene image by VGG convolutional neural networks Network computing, obtains the convolutional layer of the scene image；

Convolution feature is slided and takes subelement, for using predetermined sliding window obtain the scene image specific region last The convolution feature of individual convolutional layer.

With reference to second aspect, in second possible implementation of second aspect, the text candidates frame signal generating unit Including：

Text candidates anchor frame training subelement, for the convolution feature to be sent into into two-way shot and long term memory modelses, will be every Individual convolution feature is trained as the time frame input of shot and long term memory modelses, is fixed the text candidates anchor frame of width；

Text candidates frame detection sub-unit, for carrying out back to the lower edges of the text candidates anchor frame of the fixed width Return, detect and connect, generate text candidates frame sequence.

With reference to second possible implementation of second aspect, in the third possible implementation of second aspect, institute Stating text candidates frame detection sub-unit includes：

Supervision message acquisition module, for obtaining the supervision message of the text candidates anchor frame, the supervision message includes： Text candidates anchor frame is the score value of text, text candidates anchor frame apart from oneself nearest text row bound upper end the first offset distance From, text candidates anchor frame from oneself nearest text row bound lower end the second offset distance；

Comparison module is selected, for according to the supervision message of the text candidates anchor frame, selecting score value to be more than predetermined value Text candidates anchor frame, with reference to first offset distance and the second offset distance, generate text candidates frame sequence.

Second of the first possible implementation, second aspect with reference to second aspect, second aspect may realization side Formula, the third possible implementation of second aspect, in the 4th kind of possible implementation of second aspect, the line of text area Domain acquiring unit specifically for：

In the present invention, the convolution feature of scene image is extracted by convolutional neural networks, by recurrent neural network pair The convolution feature is trained, and obtains text candidates frame sequence, and carries out post processing to the text candidates frame sequence, generates Line of text region.Because this method is trained by recurrent neural networks model to convolution feature, it is possible to use convolution feature Text message up and down be trained, be conducive to improving the robustness of text detection, and be not limited to single language classification device, The detection requirement of multi-language text is adapted to, and complicated priori conditions need not be manually set, it is favourable under different scenes In the stability for improving detection.

Description of the drawings

Fig. 1 is the flowchart of the Method for text detection of scene image provided in an embodiment of the present invention；

Fig. 2 is the schematic network structure of the text detection of scene image provided in an embodiment of the present invention；

Fig. 3 is detection process schematic diagram provided in an embodiment of the present invention；

Fig. 4 is the structural representation of the text detection device of scene image provided in an embodiment of the present invention.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.

The purpose of the embodiment of the present invention is the Method for text detection and device for providing a kind of scene image, existing to solve In technology for the Method for text detection of scene image in, be typically based on the grader of single character, acted on sliding window Candidate frame, it is impossible to enough make full use of context and sequence information, and when scene is more complicated, such as illumination, shade, block Etc. the impact of natural conditions, monocase grader does not have good robustness, and at present overwhelming majority grader is all base In monolingual grader, it is impossible to one-time detection process polyglot text, and most methods have it is troublesome after Process, there are the priori conditions being much manually set, in the scene of different scene or complexity, detector does not have stable Property, in addition, most Method for text detection have very complicated flow process and it is artificial participate in, run time is longer, realizes valency Value is limited.The present invention is improved for drawbacks described above, it is proposed that it is a kind of end to end, multi-language text can be detected, and The Method for text detection of the high scene image of detection efficiency, is described in detail below to this method：

What Fig. 1 showed the Method for text detection of scene image provided in an embodiment of the present invention realizes flow process, and details are as follows：

In step S101, scene image is obtained, by the convolution of scene image described in convolutional neural networks model extraction Feature.

Specifically, scene image described in the embodiment of the present invention, can be dynamic video image, it is also possible to the static state of acquisition Photo.For dynamic video image, the frame in video can be extracted, calculation process is carried out to the frame in video.

The convolutional neural networks, for extracting intensive convolution feature.As one kind of the invention preferred embodiment, The convolutional neural networks can select the convolutional neural networks of the VGG network architectures, and convolution feature is extracted.

It is a kind of preferred embodiment in, the convolution by scene image described in convolutional neural networks model extraction is special Levying step includes：

The specific region, can be the image-region at the word place tentatively identified by VGG models.

The VGG convolutional neural networks can select the VGG16 network architectures, and the default sliding window can be 3*3 sizes Sliding window.As shown in Fig. 2 from VGG16 [10], generating multiple convolutional layers, and (such as can be to last convolutional layer Con5 characteristic pattern) carries out slide window processing.The sliding window can select the sliding window of 3*3, can obtain sliding window operation each time right The convolution feature answered.

Wherein, the VGG network architectures and google net are the image classification models of two quasi-representatives.This two class models knot It is deep learning that structure has a common feature.With unlike googlenet, VGG inherits the one of lenet and alexnet A little frameworks.It is especially similar with alexnet frameworks.VGG can have 5 or more convolutional layer, 2 layers of fc characteristics of image, one layer Fc characteristic of divisions, can regard as alex net 8 parts altogether as.According to front 5 convolutional layers.Difference in each layer Configuration.

In step s 102, the convolution feature of the scene image is sent into into recurrent neural networks model, generates text and wait Select frame sequence.

The acquisition order of the convolution feature in the scene image, successively send the convolution feature of the scene image Enter in recurrent neural networks model and be trained.Wherein, the recurrent neural networks model can remember mould for two-way shot and long term Type LSTM (Long Short-Term Memory) is a kind of time recurrent neural network.

Specifically, the convolution feature by the scene image sends into recurrent neural networks model, generates text candidates Frame sequence step can include：

The convolution feature is sent into into recurrent neural networks model by row, the recurrent neural networks model can be two-way Shot and long term memory modelses, are trained each convolution feature as the time frame input of shot and long term memory modelses, are fixed The text candidates anchor frame of width；

Specifically, the frame of text candidates anchor described in the embodiment of the present invention, to refer to and confined certain area according to fixed width Frame, but be not necessarily just belonging in text candidates anchor frame text filed.The text candidates frame sequence is represented through to text After the detection of candidate anchor frame, filter out and meet the region that one or more text candidates anchor frame of text requirement is constituted.And The line of text region subsequently mentioned, then refer to the standard text area obtained after the accurate adjustment that post processing is carried out by text candidates frame.

Last convolutional layer that we obtain convolutional neural networks carries out sliding window and extracts after convolution feature, and convolution is special Levy by row input into the two-way shot and long term memory modelses LSTM of recurrent neural network, the order that will be extracted according to sliding window, successively It is input into into two-way shot and long term memory modelses LSTM.So, as shown in Fig. 2 last convolutional layer of convolutional neural networks Every a line on characteristic image, all as the time series of a two-way shot and long term memory modelses LSTM, sequentially inputs to two-way In shot and long term memory modelses LSTM.So, each point on characteristic image, is equivalent to two-way shot and long term memory modelses LSTM In each time frame.By such design, when our network just can be judged present frame, using upper frame With the information of lower frame so that it is more accurate to judge.

As shown in Fig. 2 there are 256 outputs in the two-way shot and long term memory modelses LSTM layers, and may be coupled to one It is individual to have 512 full connections for exporting.Can connect three output layers behind this connection.These three output layers can respectively predict text This candidate anchor frame is the probability (can be represented by score value Score) of the text candidates anchor frame with text, on vertical direction Vertical coordinate (Vertical coordinate) and horizontal level skew (Side-refinement), and export it is fixed wide The sequence of the text candidates frame of degree.Can by the color of text candidates frame represent text candidate frame be text probability (Score) candidate frame of the fraction more than certain threshold value, is only depicted in second figure of Fig. 2 and Fig. 3.

Specifically, the lower edges of the text candidates anchor frame to the fixed width of the present invention are returned, detected With connect, generate text candidates frame sequence step can include：

The supervision message of the text candidates anchor frame is obtained, the supervision message includes：Text candidates anchor frame is text Score value, text candidates anchor apart from oneself nearest text row bound upper end the first offset distance, text candidates anchor frame distance from Second offset distance of oneself nearest text row bound lower end；

In the training process, we add supervision message to text candidates frame so that network can converge to us and wish The result of prestige.Supervision message can include：

(1) each text candidates anchor frame is the classification supervision message of text.

(2) each text candidates anchor frame is relative to the side-play amount apart from oneself nearest text row bound upper end.

(3) second offset distance of each text candidates anchor frame apart from oneself nearest text row bound lower end.

We can use Softmax functions as our Classification Loss function.For recurrence, we can use SmoothL1Loss [12] functions are used as our loss function.

In test process, we can obtain according to above-mentioned supervision message：

(1) each text candidates anchor frame is the probability (fraction) of text.

(2) each text candidates anchor frame is relative to the skew apart from oneself nearest text row bound upper end.

(3) skew of each text candidates anchor frame apart from oneself nearest text row bound lower end.

By above-mentioned supervision message, we can choose those fractions more than preset value, such as the text candidates more than 0.7 Anchor frame, then can be obtained by text candidates frame sequence plus corresponding skew.

In step s 103, according to the difference in height and horizontal range of the text candidates frame sequence, select to be in level Horizontal range between the text box at edge, and in the text candidates frame at vertical edge, generate line of text region.

We can construct a reachable graph using some simple clues.Then in this drawing by the company of searching The mode in logical domain allows us to obtain final line of text region connecting these text candidates frames.These simple clues Difference in height including the horizontal range between text box and text box etc..Become text when we connect these text candidates frames After one's own profession region, we choose those text candidates frames in edge, corresponding skew are added, so as to effectively complete Accurate adjustment operation to the horizontal coordinate of line of text.As shown in figure 3, after input picture, generating text candidates frame sequence, further Accurate adjustment obtains final detection result, that is, generate line of text region.

This method is trained by recurrent neural networks model to convolution feature, it is possible to use the context of convolution feature This information is trained, and is conducive to improving the robustness of text detection, and is not limited to single language classification device, is adapted to The detection of multi-language text is required, and need not be manually set the priori conditions of complexity, is conducive to improving inspection under different scenes The stability of survey.In addition, this method efficiently using the computing resource of overlapping region can effectively improve calculating effect Rate, by model end to end, can simplify and calculate the step of processing.

Fig. 4 carries the structural representation of the text detection device of scene image for the embodiment of the present invention, and details are as follows：

The text detection device of scene image described in the embodiment of the present invention, including：

Convolution feature acquiring unit 401, for obtaining scene image, by scene described in convolutional neural networks model extraction The convolution feature of image；

Text candidates frame signal generating unit 402, for the convolution feature of the scene image to be sent into into recurrent neural network mould Type, generates text candidates frame sequence；

Line of text area acquisition unit 403, for carrying out post processing to the text candidates frame sequence, obtains line of text area Domain.

Preferably, the convolution feature acquiring unit includes：

Preferably, the text candidates frame signal generating unit includes：

Preferably, the text candidates frame detection sub-unit includes：

Preferably, the line of text area acquisition unit specifically for：

The text detection device of scene image described in the embodiment of the present invention, the Method for text detection pair with above-mentioned scene image Should, here is not repeated and repeats.

In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method, it can be passed through Its mode is realized.For example, device embodiment described above is only schematic, for example, the division of the unit, and only Only a kind of division of logic function, can there is other dividing mode when actually realizing, such as multiple units or component can be tied Close or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, it is shown or discussed Coupling each other or direct-coupling or communication connection can be the INDIRECT COUPLINGs by some interfaces, device or unit or logical Letter connection, can be electrical, mechanical or other forms.

The unit as separating component explanation can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can according to the actual needs be selected to realize the mesh of this embodiment scheme 's.

In addition, each functional unit in each embodiment of the invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated list Unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.

If the integrated unit is realized using in the form of SFU software functional unit and as independent production marketing or used When, during a computer read/write memory medium can be stored in.Based on such understanding, technical scheme is substantially The part for contributing to prior art in other words or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server, or network equipment etc.) performs the complete of each embodiment methods described of the invention Portion or part.And aforesaid storage medium includes：USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with store program codes Medium.

Presently preferred embodiments of the present invention is the foregoing is only, not to limit the present invention, all essences in the present invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims

1. a kind of Method for text detection of scene image, it is characterised in that methods described includes：

2. method according to claim 1, it is characterised in that described by scene graph described in convolutional neural networks model extraction The convolution characterization step of picture includes：

Convolutional neural networks computing is carried out to the scene image by VGG convolutional neural networks, the scene image is obtained Convolutional layer；

3. method according to claim 1, it is characterised in that the convolution feature by the scene image sends into recurrence god Jing network modeies, generating text candidates frame sequence step includes：

The convolution feature is sent into into recurrent neural networks model by row, using each convolution feature as shot and long term memory modelses Time frame input is trained, and is fixed the text candidates anchor frame of width；

The lower edges of the text candidates anchor frame of the fixed width are returned, detected and connected, text candidates frame is generated Sequence.

4. method according to claim 3, it is characterised in that the text candidates anchor frame to the fixed width it is upper and lower Edge is returned, detected and connected, and generating text candidates frame sequence step includes：

The supervision message of the text candidates anchor frame is obtained, the supervision message includes：Text candidates anchor frame is the scoring of text Value, text candidates anchor frame apart from oneself nearest text row bound upper end the first offset distance, text candidates anchor frame apart from oneself Second offset distance of nearest text row bound lower end；

According to the supervision message of the text candidates anchor frame, score value is selected more than the text candidates anchor frame of predetermined value, with reference to institute The first offset distance and the second offset distance are stated, text candidates frame sequence is generated.

5. according to claim 1-4 any one methods described, it is characterised in that it is described the text candidates frame sequence is carried out after Process, obtaining line of text region step includes：

According to the difference in height and horizontal range of the text candidates frame sequence, select between the text box of horizontal edge Horizontal range, and in the text candidates frame at vertical edge, generate line of text region.

6. the text detection device of a kind of scene image, it is characterised in that described device includes：

Convolution feature acquiring unit, for obtaining scene image, by scene image described in convolutional neural networks model extraction Convolution feature；

Text candidates frame signal generating unit, for the convolution feature of the scene image to be sent into into recurrent neural networks model, generates Text candidates frame sequence；

Line of text area acquisition unit, for carrying out post processing to the text candidates frame sequence, obtains line of text region.

7. device according to claim 6, it is characterised in that the convolution feature acquiring unit includes：

Convolutional layer obtains subelement, for carrying out convolutional neural networks fortune to the scene image by VGG convolutional neural networks Calculate, obtain the convolutional layer of the scene image；

Convolution feature is slided and takes subelement, for obtaining last volume of the scene image specific region using predetermined sliding window The convolution feature of lamination.

8. device according to claim 6, it is characterised in that the text candidates frame signal generating unit includes：

Text candidates anchor frame training subelement, for the convolution feature to be sent into into recurrent neural networks model by row, by each Convolution feature is trained as the time frame input of shot and long term memory modelses, is fixed the text candidates anchor frame of width；

Text candidates frame detection sub-unit, for being returned to the lower edges of the text candidates anchor frame of the fixed width, Detect and connect, generate text candidates frame sequence.

9. device according to claim 8, it is characterised in that the text candidates frame detection sub-unit includes：

Supervision message acquisition module, for obtaining the supervision message of the text candidates anchor frame, the supervision message includes：Text Candidate anchor frame is the score value of text, text candidates anchor frame apart from oneself nearest text row bound upper end the first offset distance, Second offset distance of the text candidates anchor frame from oneself nearest text row bound lower end；

Comparison module is selected, for according to the supervision message of the text candidates anchor frame, selecting score value more than the text of predetermined value This candidate anchor frame, with reference to first offset distance and the second offset distance, generates text candidates frame sequence.

10. according to claim 6-9 any one described device, it is characterised in that the line of text area acquisition unit is specifically used In：