CN106570497A - Text detection method and device for scene image - Google Patents

Text detection method and device for scene image Download PDF

Info

Publication number
CN106570497A
CN106570497A CN201610878795.5A CN201610878795A CN106570497A CN 106570497 A CN106570497 A CN 106570497A CN 201610878795 A CN201610878795 A CN 201610878795A CN 106570497 A CN106570497 A CN 106570497A
Authority
CN
China
Prior art keywords
text
frame
text candidates
scene image
candidates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610878795.5A
Other languages
Chinese (zh)
Inventor
乔宇
黄韡林
田值
贺通
贺盼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201610878795.5A priority Critical patent/CN106570497A/en
Publication of CN106570497A publication Critical patent/CN106570497A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a text detection method for a scene image. The method comprises the steps of obtaining a scene image, and extracting the convolution features of the scene image through a convolutional neural network model; loading the convolution features of the scene into a recursive neural network model to generate a text candidate frame sequence, and subjecting the text candidate frame sequence to post-treatment to obtain a text line region. According to the technical scheme of the invention, the convolution features are trained through the recursive neural network model, and then the context text information of the convolution features are trained. Therefore, the text detection robustness is improved. Meanwhile, the method is not limited to a single language classifier, and can be applied to the text detection process of multiple languages. Moreover, complex priori conditions are not manually preset, so that the detection stability is improved under different scenes. Based on the computing resources of overlapped regions, the computational efficiency is effectively improved. Based on an end-to-end model, the calculating and processing course is simplified.

Description

A kind of Method for text detection and device of scene image
Technical field
The invention belongs to field of image detection, more particularly to a kind of Method for text detection and device of scene image.
Background technology
Word in scene image is effectively recognized, can be people's bringing great convenience property of life.Such as Relevant information of vehicle etc. can be automatically searched according to license plate number to the knowledge figure of the contents such as the car plate in image.And it is accurate It is text filed in true detection and positioning image, it is basis and the premise of unrestricted natural scene Chinese version identification.
At present for the method for text detection, the Method for text detection based on connection component is generally comprised, and based on cunning The Method for text detection of dynamic window.
Wherein, the Method for text detection based on connection component, with a quick method (such as MSER (Maximally Stable Extremal Regions, maximum stable extremal region), SWT (applying paintbrush width) etc.) go to separate text pixel With non-textual pixel.Then text pixel is stroke or character candidates by wolfishly group is beaten using low-level image feature.The bottom for using Layer is generally characterized by gray value, color or gradient etc..Based on the Method for text detection of connection component, to non-interconnected word (ratio Such as be right text) can not effectively be detected.
Based on the Method for text detection of sliding window, one window of slip intensive in the picture is referred to, while in sliding window Each position at applying detection algorithm (using manual designs low-level image feature or CNN (convolutional neural networks)).Based on cunning Although the method for dynamic window not there is a problem of not supporting non-interconnected word, need to process Issues On Multi-scales.In order to solve Issues On Multi-scales, generally require and make good use of the sliding window of several yardsticks and slide respectively over image, can so increase the amount of calculation of text detection.
Because current Method for text detection is normally based on the grader of single character, candidate is acted on sliding window Frame, when scene is complicated, such as illumination, shade, the impact of natural conditions such as blocks, and affects the robustness of character classification, it is impossible to Single treatment includes the image of polyglot text, when detecting under different scenes, the less stable of detection.
The content of the invention
It is an object of the invention to provide a kind of Method for text detection of scene image, is examined with the text for solving prior art In survey method, the poor robustness of character classification, it is impossible to which single treatment includes the image of polyglot text is examined under different scenes The problem of the less stable of survey.
In a first aspect, embodiments providing a kind of Method for text detection of scene image, methods described includes:
Scene image is obtained, by the convolution feature of scene image described in convolutional neural networks model extraction;
The convolution feature of the scene image is sent into into recurrent neural networks model, text candidates frame sequence is generated;
Post processing is carried out to the text candidates frame sequence, line of text region is obtained.
It is described by convolutional neural networks mould with reference in a first aspect, in the first possible implementation of first aspect Type extracts the convolution characterization step of the scene image to be included:
Convolutional neural networks computing is carried out to the scene image by VGG convolutional neural networks, the scene graph is obtained The convolutional layer of picture;
The convolution feature of last convolutional layer of the scene image specific region is obtained using predetermined sliding window.
With reference to second possible implementation of first aspect, in second possible implementation of first aspect, institute State and the convolution feature of the scene image is sent into into recurrent neural networks model, generating text candidates frame sequence step includes:
The convolution feature is sent into into recurrent neural networks model by row, using each convolution feature as shot and long term mould is remembered The time frame input of type is trained, and is fixed the text candidates anchor frame of width;
The lower edges of the text candidates anchor frame of the fixed width are returned, detected and connected, text is generated and is waited Select frame sequence.
With reference in a first aspect, in the third possible implementation of first aspect, waiting to the text of the fixed width The lower edges for selecting anchor frame are returned, detected and connected, and generating text candidates frame sequence step includes:
The supervision message of the text candidates anchor frame is obtained, the supervision message includes:Text candidates anchor frame is text The first offset distance, the text candidates anchor frame distance of score value, text candidates anchor frame apart from oneself nearest text row bound upper end Second offset distance of oneself nearest text row bound lower end;
According to the supervision message of the text candidates anchor frame, score value is selected more than the text candidates anchor frame of predetermined value, knot First offset distance and the second offset distance are closed, text candidates frame sequence is generated.
Second of the first possible implementation, first aspect with reference to first aspect, first aspect may realization side Formula, the third possible implementation of first aspect, it is described to the text in the 4th kind of possible implementation of first aspect This candidate frame sequence carries out post processing, and obtaining line of text region step includes:
According to the difference in height and horizontal range of the text candidates frame sequence, select text box in horizontal edge it Between horizontal range, and in the text candidates frame at vertical edge, generate line of text region.
Second aspect, embodiments provides a kind of text detection device of scene image, and described device includes:
Convolution feature acquiring unit, for obtaining scene image, by scene graph described in convolutional neural networks model extraction The convolution feature of picture;
Text candidates frame signal generating unit, for the convolution feature of the scene image to be sent into into recurrent neural networks model, Generate text candidates frame sequence;
Line of text area acquisition unit, for carrying out post processing to the text candidates frame sequence, obtains line of text area Domain..
With reference to second aspect, in the first possible implementation of second aspect, the convolution feature acquiring unit bag Include:
Convolutional layer obtains subelement, for carrying out convolutional Neural net to the scene image by VGG convolutional neural networks Network computing, obtains the convolutional layer of the scene image;
Convolution feature is slided and takes subelement, for using predetermined sliding window obtain the scene image specific region last The convolution feature of individual convolutional layer.
With reference to second aspect, in second possible implementation of second aspect, the text candidates frame signal generating unit Including:
Text candidates anchor frame training subelement, for the convolution feature to be sent into into two-way shot and long term memory modelses, will be every Individual convolution feature is trained as the time frame input of shot and long term memory modelses, is fixed the text candidates anchor frame of width;
Text candidates frame detection sub-unit, for carrying out back to the lower edges of the text candidates anchor frame of the fixed width Return, detect and connect, generate text candidates frame sequence.
With reference to second possible implementation of second aspect, in the third possible implementation of second aspect, institute Stating text candidates frame detection sub-unit includes:
Supervision message acquisition module, for obtaining the supervision message of the text candidates anchor frame, the supervision message includes: Text candidates anchor frame is the score value of text, text candidates anchor frame apart from oneself nearest text row bound upper end the first offset distance From, text candidates anchor frame from oneself nearest text row bound lower end the second offset distance;
Comparison module is selected, for according to the supervision message of the text candidates anchor frame, selecting score value to be more than predetermined value Text candidates anchor frame, with reference to first offset distance and the second offset distance, generate text candidates frame sequence.
Second of the first possible implementation, second aspect with reference to second aspect, second aspect may realization side Formula, the third possible implementation of second aspect, in the 4th kind of possible implementation of second aspect, the line of text area Domain acquiring unit specifically for:
According to the difference in height and horizontal range of the text candidates frame sequence, select text box in horizontal edge it Between horizontal range, and in the text candidates frame at vertical edge, generate line of text region.
In the present invention, the convolution feature of scene image is extracted by convolutional neural networks, by recurrent neural network pair The convolution feature is trained, and obtains text candidates frame sequence, and carries out post processing to the text candidates frame sequence, generates Line of text region.Because this method is trained by recurrent neural networks model to convolution feature, it is possible to use convolution feature Text message up and down be trained, be conducive to improving the robustness of text detection, and be not limited to single language classification device, The detection requirement of multi-language text is adapted to, and complicated priori conditions need not be manually set, it is favourable under different scenes In the stability for improving detection.
Description of the drawings
Fig. 1 is the flowchart of the Method for text detection of scene image provided in an embodiment of the present invention;
Fig. 2 is the schematic network structure of the text detection of scene image provided in an embodiment of the present invention;
Fig. 3 is detection process schematic diagram provided in an embodiment of the present invention;
Fig. 4 is the structural representation of the text detection device of scene image provided in an embodiment of the present invention.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.
The purpose of the embodiment of the present invention is the Method for text detection and device for providing a kind of scene image, existing to solve In technology for the Method for text detection of scene image in, be typically based on the grader of single character, acted on sliding window Candidate frame, it is impossible to enough make full use of context and sequence information, and when scene is more complicated, such as illumination, shade, block Etc. the impact of natural conditions, monocase grader does not have good robustness, and at present overwhelming majority grader is all base In monolingual grader, it is impossible to one-time detection process polyglot text, and most methods have it is troublesome after Process, there are the priori conditions being much manually set, in the scene of different scene or complexity, detector does not have stable Property, in addition, most Method for text detection have very complicated flow process and it is artificial participate in, run time is longer, realizes valency Value is limited.The present invention is improved for drawbacks described above, it is proposed that it is a kind of end to end, multi-language text can be detected, and The Method for text detection of the high scene image of detection efficiency, is described in detail below to this method:
What Fig. 1 showed the Method for text detection of scene image provided in an embodiment of the present invention realizes flow process, and details are as follows:
In step S101, scene image is obtained, by the convolution of scene image described in convolutional neural networks model extraction Feature.
Specifically, scene image described in the embodiment of the present invention, can be dynamic video image, it is also possible to the static state of acquisition Photo.For dynamic video image, the frame in video can be extracted, calculation process is carried out to the frame in video.
The convolutional neural networks, for extracting intensive convolution feature.As one kind of the invention preferred embodiment, The convolutional neural networks can select the convolutional neural networks of the VGG network architectures, and convolution feature is extracted.
It is a kind of preferred embodiment in, the convolution by scene image described in convolutional neural networks model extraction is special Levying step includes:
Convolutional neural networks computing is carried out to the scene image by VGG convolutional neural networks, the scene graph is obtained The convolutional layer of picture;
The convolution feature of last convolutional layer of the scene image specific region is obtained using predetermined sliding window.
The specific region, can be the image-region at the word place tentatively identified by VGG models.
The VGG convolutional neural networks can select the VGG16 network architectures, and the default sliding window can be 3*3 sizes Sliding window.As shown in Fig. 2 from VGG16 [10], generating multiple convolutional layers, and (such as can be to last convolutional layer Con5 characteristic pattern) carries out slide window processing.The sliding window can select the sliding window of 3*3, can obtain sliding window operation each time right The convolution feature answered.
Wherein, the VGG network architectures and google net are the image classification models of two quasi-representatives.This two class models knot It is deep learning that structure has a common feature.With unlike googlenet, VGG inherits the one of lenet and alexnet A little frameworks.It is especially similar with alexnet frameworks.VGG can have 5 or more convolutional layer, 2 layers of fc characteristics of image, one layer Fc characteristic of divisions, can regard as alex net 8 parts altogether as.According to front 5 convolutional layers.Difference in each layer Configuration.
In step s 102, the convolution feature of the scene image is sent into into recurrent neural networks model, generates text and wait Select frame sequence.
The acquisition order of the convolution feature in the scene image, successively send the convolution feature of the scene image Enter in recurrent neural networks model and be trained.Wherein, the recurrent neural networks model can remember mould for two-way shot and long term Type LSTM (Long Short-Term Memory) is a kind of time recurrent neural network.
Specifically, the convolution feature by the scene image sends into recurrent neural networks model, generates text candidates Frame sequence step can include:
The convolution feature is sent into into recurrent neural networks model by row, the recurrent neural networks model can be two-way Shot and long term memory modelses, are trained each convolution feature as the time frame input of shot and long term memory modelses, are fixed The text candidates anchor frame of width;
The lower edges of the text candidates anchor frame of the fixed width are returned, detected and connected, text is generated and is waited Select frame sequence.
Specifically, the frame of text candidates anchor described in the embodiment of the present invention, to refer to and confined certain area according to fixed width Frame, but be not necessarily just belonging in text candidates anchor frame text filed.The text candidates frame sequence is represented through to text After the detection of candidate anchor frame, filter out and meet the region that one or more text candidates anchor frame of text requirement is constituted.And The line of text region subsequently mentioned, then refer to the standard text area obtained after the accurate adjustment that post processing is carried out by text candidates frame.
Last convolutional layer that we obtain convolutional neural networks carries out sliding window and extracts after convolution feature, and convolution is special Levy by row input into the two-way shot and long term memory modelses LSTM of recurrent neural network, the order that will be extracted according to sliding window, successively It is input into into two-way shot and long term memory modelses LSTM.So, as shown in Fig. 2 last convolutional layer of convolutional neural networks Every a line on characteristic image, all as the time series of a two-way shot and long term memory modelses LSTM, sequentially inputs to two-way In shot and long term memory modelses LSTM.So, each point on characteristic image, is equivalent to two-way shot and long term memory modelses LSTM In each time frame.By such design, when our network just can be judged present frame, using upper frame With the information of lower frame so that it is more accurate to judge.
As shown in Fig. 2 there are 256 outputs in the two-way shot and long term memory modelses LSTM layers, and may be coupled to one It is individual to have 512 full connections for exporting.Can connect three output layers behind this connection.These three output layers can respectively predict text This candidate anchor frame is the probability (can be represented by score value Score) of the text candidates anchor frame with text, on vertical direction Vertical coordinate (Vertical coordinate) and horizontal level skew (Side-refinement), and export it is fixed wide The sequence of the text candidates frame of degree.Can by the color of text candidates frame represent text candidate frame be text probability (Score) candidate frame of the fraction more than certain threshold value, is only depicted in second figure of Fig. 2 and Fig. 3.
Specifically, the lower edges of the text candidates anchor frame to the fixed width of the present invention are returned, detected With connect, generate text candidates frame sequence step can include:
The supervision message of the text candidates anchor frame is obtained, the supervision message includes:Text candidates anchor frame is text Score value, text candidates anchor apart from oneself nearest text row bound upper end the first offset distance, text candidates anchor frame distance from Second offset distance of oneself nearest text row bound lower end;
According to the supervision message of the text candidates anchor frame, score value is selected more than the text candidates anchor frame of predetermined value, knot First offset distance and the second offset distance are closed, text candidates frame sequence is generated.
In the training process, we add supervision message to text candidates frame so that network can converge to us and wish The result of prestige.Supervision message can include:
(1) each text candidates anchor frame is the classification supervision message of text.
(2) each text candidates anchor frame is relative to the side-play amount apart from oneself nearest text row bound upper end.
(3) second offset distance of each text candidates anchor frame apart from oneself nearest text row bound lower end.
We can use Softmax functions as our Classification Loss function.For recurrence, we can use SmoothL1Loss [12] functions are used as our loss function.
In test process, we can obtain according to above-mentioned supervision message:
(1) each text candidates anchor frame is the probability (fraction) of text.
(2) each text candidates anchor frame is relative to the skew apart from oneself nearest text row bound upper end.
(3) skew of each text candidates anchor frame apart from oneself nearest text row bound lower end.
By above-mentioned supervision message, we can choose those fractions more than preset value, such as the text candidates more than 0.7 Anchor frame, then can be obtained by text candidates frame sequence plus corresponding skew.
In step s 103, according to the difference in height and horizontal range of the text candidates frame sequence, select to be in level Horizontal range between the text box at edge, and in the text candidates frame at vertical edge, generate line of text region.
We can construct a reachable graph using some simple clues.Then in this drawing by the company of searching The mode in logical domain allows us to obtain final line of text region connecting these text candidates frames.These simple clues Difference in height including the horizontal range between text box and text box etc..Become text when we connect these text candidates frames After one's own profession region, we choose those text candidates frames in edge, corresponding skew are added, so as to effectively complete Accurate adjustment operation to the horizontal coordinate of line of text.As shown in figure 3, after input picture, generating text candidates frame sequence, further Accurate adjustment obtains final detection result, that is, generate line of text region.
This method is trained by recurrent neural networks model to convolution feature, it is possible to use the context of convolution feature This information is trained, and is conducive to improving the robustness of text detection, and is not limited to single language classification device, is adapted to The detection of multi-language text is required, and need not be manually set the priori conditions of complexity, is conducive to improving inspection under different scenes The stability of survey.In addition, this method efficiently using the computing resource of overlapping region can effectively improve calculating effect Rate, by model end to end, can simplify and calculate the step of processing.
Fig. 4 carries the structural representation of the text detection device of scene image for the embodiment of the present invention, and details are as follows:
The text detection device of scene image described in the embodiment of the present invention, including:
Convolution feature acquiring unit 401, for obtaining scene image, by scene described in convolutional neural networks model extraction The convolution feature of image;
Text candidates frame signal generating unit 402, for the convolution feature of the scene image to be sent into into recurrent neural network mould Type, generates text candidates frame sequence;
Line of text area acquisition unit 403, for carrying out post processing to the text candidates frame sequence, obtains line of text area Domain.
Preferably, the convolution feature acquiring unit includes:
Convolutional layer obtains subelement, for carrying out convolutional Neural net to the scene image by VGG convolutional neural networks Network computing, obtains the convolutional layer of the scene image;
Convolution feature is slided and takes subelement, for using predetermined sliding window obtain the scene image specific region last The convolution feature of individual convolutional layer.
Preferably, the text candidates frame signal generating unit includes:
Text candidates anchor frame training subelement, for the convolution feature to be sent into into two-way shot and long term memory modelses, will be every Individual convolution feature is trained as the time frame input of shot and long term memory modelses, is fixed the text candidates anchor frame of width;
Text candidates frame detection sub-unit, for carrying out back to the lower edges of the text candidates anchor frame of the fixed width Return, detect and connect, generate text candidates frame sequence.
Preferably, the text candidates frame detection sub-unit includes:
Supervision message acquisition module, for obtaining the supervision message of the text candidates anchor frame, the supervision message includes: Text candidates anchor frame is the score value of text, text candidates anchor frame apart from oneself nearest text row bound upper end the first offset distance From, text candidates anchor frame from oneself nearest text row bound lower end the second offset distance;
Comparison module is selected, for according to the supervision message of the text candidates anchor frame, selecting score value to be more than predetermined value Text candidates anchor frame, with reference to first offset distance and the second offset distance, generate text candidates frame sequence.
Preferably, the line of text area acquisition unit specifically for:
According to the difference in height and horizontal range of the text candidates frame sequence, select text box in horizontal edge it Between horizontal range, and in the text candidates frame at vertical edge, generate line of text region.
The text detection device of scene image described in the embodiment of the present invention, the Method for text detection pair with above-mentioned scene image Should, here is not repeated and repeats.
In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method, it can be passed through Its mode is realized.For example, device embodiment described above is only schematic, for example, the division of the unit, and only Only a kind of division of logic function, can there is other dividing mode when actually realizing, such as multiple units or component can be tied Close or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, it is shown or discussed Coupling each other or direct-coupling or communication connection can be the INDIRECT COUPLINGs by some interfaces, device or unit or logical Letter connection, can be electrical, mechanical or other forms.
The unit as separating component explanation can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can according to the actual needs be selected to realize the mesh of this embodiment scheme 's.
In addition, each functional unit in each embodiment of the invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated list Unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is realized using in the form of SFU software functional unit and as independent production marketing or used When, during a computer read/write memory medium can be stored in.Based on such understanding, technical scheme is substantially The part for contributing to prior art in other words or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server, or network equipment etc.) performs the complete of each embodiment methods described of the invention Portion or part.And aforesaid storage medium includes:USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with store program codes Medium.
Presently preferred embodiments of the present invention is the foregoing is only, not to limit the present invention, all essences in the present invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims (10)

1. a kind of Method for text detection of scene image, it is characterised in that methods described includes:
Scene image is obtained, by the convolution feature of scene image described in convolutional neural networks model extraction;
The convolution feature of the scene image is sent into into recurrent neural networks model, text candidates frame sequence is generated;
Post processing is carried out to the text candidates frame sequence, line of text region is obtained.
2. method according to claim 1, it is characterised in that described by scene graph described in convolutional neural networks model extraction The convolution characterization step of picture includes:
Convolutional neural networks computing is carried out to the scene image by VGG convolutional neural networks, the scene image is obtained Convolutional layer;
The convolution feature of last convolutional layer of the scene image specific region is obtained using predetermined sliding window.
3. method according to claim 1, it is characterised in that the convolution feature by the scene image sends into recurrence god Jing network modeies, generating text candidates frame sequence step includes:
The convolution feature is sent into into recurrent neural networks model by row, using each convolution feature as shot and long term memory modelses Time frame input is trained, and is fixed the text candidates anchor frame of width;
The lower edges of the text candidates anchor frame of the fixed width are returned, detected and connected, text candidates frame is generated Sequence.
4. method according to claim 3, it is characterised in that the text candidates anchor frame to the fixed width it is upper and lower Edge is returned, detected and connected, and generating text candidates frame sequence step includes:
The supervision message of the text candidates anchor frame is obtained, the supervision message includes:Text candidates anchor frame is the scoring of text Value, text candidates anchor frame apart from oneself nearest text row bound upper end the first offset distance, text candidates anchor frame apart from oneself Second offset distance of nearest text row bound lower end;
According to the supervision message of the text candidates anchor frame, score value is selected more than the text candidates anchor frame of predetermined value, with reference to institute The first offset distance and the second offset distance are stated, text candidates frame sequence is generated.
5. according to claim 1-4 any one methods described, it is characterised in that it is described the text candidates frame sequence is carried out after Process, obtaining line of text region step includes:
According to the difference in height and horizontal range of the text candidates frame sequence, select between the text box of horizontal edge Horizontal range, and in the text candidates frame at vertical edge, generate line of text region.
6. the text detection device of a kind of scene image, it is characterised in that described device includes:
Convolution feature acquiring unit, for obtaining scene image, by scene image described in convolutional neural networks model extraction Convolution feature;
Text candidates frame signal generating unit, for the convolution feature of the scene image to be sent into into recurrent neural networks model, generates Text candidates frame sequence;
Line of text area acquisition unit, for carrying out post processing to the text candidates frame sequence, obtains line of text region.
7. device according to claim 6, it is characterised in that the convolution feature acquiring unit includes:
Convolutional layer obtains subelement, for carrying out convolutional neural networks fortune to the scene image by VGG convolutional neural networks Calculate, obtain the convolutional layer of the scene image;
Convolution feature is slided and takes subelement, for obtaining last volume of the scene image specific region using predetermined sliding window The convolution feature of lamination.
8. device according to claim 6, it is characterised in that the text candidates frame signal generating unit includes:
Text candidates anchor frame training subelement, for the convolution feature to be sent into into recurrent neural networks model by row, by each Convolution feature is trained as the time frame input of shot and long term memory modelses, is fixed the text candidates anchor frame of width;
Text candidates frame detection sub-unit, for being returned to the lower edges of the text candidates anchor frame of the fixed width, Detect and connect, generate text candidates frame sequence.
9. device according to claim 8, it is characterised in that the text candidates frame detection sub-unit includes:
Supervision message acquisition module, for obtaining the supervision message of the text candidates anchor frame, the supervision message includes:Text Candidate anchor frame is the score value of text, text candidates anchor frame apart from oneself nearest text row bound upper end the first offset distance, Second offset distance of the text candidates anchor frame from oneself nearest text row bound lower end;
Comparison module is selected, for according to the supervision message of the text candidates anchor frame, selecting score value more than the text of predetermined value This candidate anchor frame, with reference to first offset distance and the second offset distance, generates text candidates frame sequence.
10. according to claim 6-9 any one described device, it is characterised in that the line of text area acquisition unit is specifically used In:
According to the difference in height and horizontal range of the text candidates frame sequence, select between the text box of horizontal edge Horizontal range, and in the text candidates frame at vertical edge, generate line of text region.
CN201610878795.5A 2016-10-08 2016-10-08 Text detection method and device for scene image Pending CN106570497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610878795.5A CN106570497A (en) 2016-10-08 2016-10-08 Text detection method and device for scene image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610878795.5A CN106570497A (en) 2016-10-08 2016-10-08 Text detection method and device for scene image

Publications (1)

Publication Number Publication Date
CN106570497A true CN106570497A (en) 2017-04-19

Family

ID=58532561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610878795.5A Pending CN106570497A (en) 2016-10-08 2016-10-08 Text detection method and device for scene image

Country Status (1)

Country Link
CN (1) CN106570497A (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423756A (en) * 2017-07-05 2017-12-01 武汉科恩斯医疗科技有限公司 Nuclear magnetic resonance image sequence sorting technique based on depth convolutional neural networks combination shot and long term memory models
CN107480726A (en) * 2017-08-25 2017-12-15 电子科技大学 A kind of Scene Semantics dividing method based on full convolution and shot and long term mnemon
CN107545262A (en) * 2017-07-31 2018-01-05 华为技术有限公司 A kind of method and device that text is detected in natural scene image
CN108154145A (en) * 2018-01-24 2018-06-12 北京地平线机器人技术研发有限公司 The method and apparatus for detecting the position of the text in natural scene image
CN108304761A (en) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 Method for text detection, device, storage medium and computer equipment
CN108564084A (en) * 2018-05-08 2018-09-21 北京市商汤科技开发有限公司 character detecting method, device, terminal and storage medium
CN108694393A (en) * 2018-05-30 2018-10-23 深圳市思迪信息技术股份有限公司 A kind of certificate image text area extraction method based on depth convolution
CN108734169A (en) * 2018-05-21 2018-11-02 南京邮电大学 One kind being based on the improved scene text extracting method of full convolutional network
CN108764228A (en) * 2018-05-28 2018-11-06 嘉兴善索智能科技有限公司 Word object detection method in a kind of image
CN108846379A (en) * 2018-07-03 2018-11-20 南京览笛信息科技有限公司 Face list recognition methods, system, terminal device and storage medium
CN109344822A (en) * 2018-09-03 2019-02-15 电子科技大学 A kind of scene text detection method based on shot and long term memory network
CN109685055A (en) * 2018-12-26 2019-04-26 北京金山数字娱乐科技有限公司 Text filed detection method and device in a kind of image
CN109740585A (en) * 2018-03-28 2019-05-10 北京字节跳动网络技术有限公司 A kind of text positioning method and device
CN109871843A (en) * 2017-12-01 2019-06-11 北京搜狗科技发展有限公司 Character identifying method and device, the device for character recognition
CN109948615A (en) * 2019-03-26 2019-06-28 中国科学技术大学 Multi-language text detects identifying system
CN109993040A (en) * 2018-01-03 2019-07-09 北京世纪好未来教育科技有限公司 Text recognition method and device
CN110008900A (en) * 2019-04-02 2019-07-12 北京市遥感信息研究所 A kind of visible remote sensing image candidate target extracting method by region to target
CN110084240A (en) * 2019-04-24 2019-08-02 网易(杭州)网络有限公司 A kind of Word Input system, method, medium and calculate equipment
CN110110777A (en) * 2019-04-28 2019-08-09 网易有道信息技术(北京)有限公司 Image processing method and training method and device, medium and calculating equipment
CN110245545A (en) * 2018-09-26 2019-09-17 浙江大华技术股份有限公司 A kind of character recognition method and device
CN110276351A (en) * 2019-06-28 2019-09-24 中国科学技术大学 Multilingual scene text detection and recognition methods
CN111127593A (en) * 2018-10-30 2020-05-08 珠海金山办公软件有限公司 Document content erasing method and device, electronic equipment and readable storage medium
CN111222589A (en) * 2018-11-27 2020-06-02 中国移动通信集团辽宁有限公司 Image text recognition method, device, equipment and computer storage medium
CN111612003A (en) * 2019-02-22 2020-09-01 北京京东尚科信息技术有限公司 Method and device for extracting text in picture
CN111695377A (en) * 2019-03-13 2020-09-22 杭州海康威视数字技术股份有限公司 Text detection method and device and computer equipment
CN112070784A (en) * 2020-09-15 2020-12-11 桂林电子科技大学 Perception edge detection method based on context enhancement network
CN114677691A (en) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 Text recognition method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631426A (en) * 2015-12-29 2016-06-01 中国科学院深圳先进技术研究院 Image text detection method and device
CN105654127A (en) * 2015-12-30 2016-06-08 成都数联铭品科技有限公司 End-to-end-based picture character sequence continuous recognition method
CN105678292A (en) * 2015-12-30 2016-06-15 成都数联铭品科技有限公司 Complex optical text sequence identification system based on convolution and recurrent neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631426A (en) * 2015-12-29 2016-06-01 中国科学院深圳先进技术研究院 Image text detection method and device
CN105654127A (en) * 2015-12-30 2016-06-08 成都数联铭品科技有限公司 End-to-end-based picture character sequence continuous recognition method
CN105678292A (en) * 2015-12-30 2016-06-15 成都数联铭品科技有限公司 Complex optical text sequence identification system based on convolution and recurrent neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHI TIAN等: "Detecting Text in Natural Image withe Connectionist Text Proposal Network", 《COMPUTER VISION-EECV 2016》 *

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423756A (en) * 2017-07-05 2017-12-01 武汉科恩斯医疗科技有限公司 Nuclear magnetic resonance image sequence sorting technique based on depth convolutional neural networks combination shot and long term memory models
CN107545262A (en) * 2017-07-31 2018-01-05 华为技术有限公司 A kind of method and device that text is detected in natural scene image
CN107480726A (en) * 2017-08-25 2017-12-15 电子科技大学 A kind of Scene Semantics dividing method based on full convolution and shot and long term mnemon
US11030471B2 (en) 2017-09-25 2021-06-08 Tencent Technology (Shenzhen) Company Limited Text detection method, storage medium, and computer device
CN108304761A (en) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 Method for text detection, device, storage medium and computer equipment
CN109871843B (en) * 2017-12-01 2022-04-08 北京搜狗科技发展有限公司 Character recognition method and device for character recognition
CN109871843A (en) * 2017-12-01 2019-06-11 北京搜狗科技发展有限公司 Character identifying method and device, the device for character recognition
CN109993040B (en) * 2018-01-03 2021-07-30 北京世纪好未来教育科技有限公司 Text recognition method and device
CN109993040A (en) * 2018-01-03 2019-07-09 北京世纪好未来教育科技有限公司 Text recognition method and device
CN108154145A (en) * 2018-01-24 2018-06-12 北京地平线机器人技术研发有限公司 The method and apparatus for detecting the position of the text in natural scene image
CN108154145B (en) * 2018-01-24 2020-05-19 北京地平线机器人技术研发有限公司 Method and device for detecting position of text in natural scene image
CN109740585A (en) * 2018-03-28 2019-05-10 北京字节跳动网络技术有限公司 A kind of text positioning method and device
CN108564084A (en) * 2018-05-08 2018-09-21 北京市商汤科技开发有限公司 character detecting method, device, terminal and storage medium
CN108734169A (en) * 2018-05-21 2018-11-02 南京邮电大学 One kind being based on the improved scene text extracting method of full convolutional network
CN108764228A (en) * 2018-05-28 2018-11-06 嘉兴善索智能科技有限公司 Word object detection method in a kind of image
CN108694393A (en) * 2018-05-30 2018-10-23 深圳市思迪信息技术股份有限公司 A kind of certificate image text area extraction method based on depth convolution
CN108846379A (en) * 2018-07-03 2018-11-20 南京览笛信息科技有限公司 Face list recognition methods, system, terminal device and storage medium
CN109344822A (en) * 2018-09-03 2019-02-15 电子科技大学 A kind of scene text detection method based on shot and long term memory network
CN109344822B (en) * 2018-09-03 2022-06-03 电子科技大学 Scene text detection method based on long-term and short-term memory network
CN110245545A (en) * 2018-09-26 2019-09-17 浙江大华技术股份有限公司 A kind of character recognition method and device
CN111127593B (en) * 2018-10-30 2023-10-31 珠海金山办公软件有限公司 Document content erasing method and device, electronic equipment and readable storage medium
CN111127593A (en) * 2018-10-30 2020-05-08 珠海金山办公软件有限公司 Document content erasing method and device, electronic equipment and readable storage medium
CN111222589B (en) * 2018-11-27 2023-07-18 中国移动通信集团辽宁有限公司 Image text recognition method, device, equipment and computer storage medium
CN111222589A (en) * 2018-11-27 2020-06-02 中国移动通信集团辽宁有限公司 Image text recognition method, device, equipment and computer storage medium
CN109685055B (en) * 2018-12-26 2021-11-12 北京金山数字娱乐科技有限公司 Method and device for detecting text area in image
CN109685055A (en) * 2018-12-26 2019-04-26 北京金山数字娱乐科技有限公司 Text filed detection method and device in a kind of image
CN111612003B (en) * 2019-02-22 2024-08-20 北京京东尚科信息技术有限公司 Method and device for extracting text in picture
CN111612003A (en) * 2019-02-22 2020-09-01 北京京东尚科信息技术有限公司 Method and device for extracting text in picture
CN111695377A (en) * 2019-03-13 2020-09-22 杭州海康威视数字技术股份有限公司 Text detection method and device and computer equipment
CN111695377B (en) * 2019-03-13 2023-09-29 杭州海康威视数字技术股份有限公司 Text detection method and device and computer equipment
CN109948615A (en) * 2019-03-26 2019-06-28 中国科学技术大学 Multi-language text detects identifying system
CN109948615B (en) * 2019-03-26 2021-01-26 中国科学技术大学 Multi-language text detection and recognition system
CN110008900A (en) * 2019-04-02 2019-07-12 北京市遥感信息研究所 A kind of visible remote sensing image candidate target extracting method by region to target
CN110008900B (en) * 2019-04-02 2023-12-12 北京市遥感信息研究所 Method for extracting candidate target from visible light remote sensing image from region to target
CN110084240A (en) * 2019-04-24 2019-08-02 网易(杭州)网络有限公司 A kind of Word Input system, method, medium and calculate equipment
CN110110777A (en) * 2019-04-28 2019-08-09 网易有道信息技术(北京)有限公司 Image processing method and training method and device, medium and calculating equipment
CN110276351A (en) * 2019-06-28 2019-09-24 中国科学技术大学 Multilingual scene text detection and recognition methods
CN110276351B (en) * 2019-06-28 2022-09-06 中国科学技术大学 Multi-language scene text detection and identification method
CN112070784B (en) * 2020-09-15 2022-07-01 桂林电子科技大学 Perception edge detection method based on context enhancement network
CN112070784A (en) * 2020-09-15 2020-12-11 桂林电子科技大学 Perception edge detection method based on context enhancement network
CN114677691B (en) * 2022-04-06 2023-10-03 北京百度网讯科技有限公司 Text recognition method, device, electronic equipment and storage medium
CN114677691A (en) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 Text recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106570497A (en) Text detection method and device for scene image
Nakamura et al. Scene text eraser
Paris et al. A topological approach to hierarchical segmentation using mean shift
US20180114071A1 (en) Method for analysing media content
Tang et al. Deeply-supervised recurrent convolutional neural network for saliency detection
CN110222686B (en) Object detection method, object detection device, computer equipment and storage medium
US11853892B2 (en) Learning to segment via cut-and-paste
WO2021115345A1 (en) Image processing method and apparatus, computer device, and storage medium
CN107945153A (en) A kind of road surface crack detection method based on deep learning
CN107871125A (en) Architecture against regulations recognition methods, device and electronic equipment
CN112488229B (en) Domain self-adaptive unsupervised target detection method based on feature separation and alignment
CN111401293B (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
WO2020077940A1 (en) Method and device for automatic identification of labels of image
CN103810503A (en) Depth study based method for detecting salient regions in natural image
CN104008401A (en) Method and device for image character recognition
Li et al. Fast and effective text detection
CN110516676A (en) A kind of bank's card number identifying system based on image procossing
Zhang et al. Deep salient object detection by integrating multi-level cues
Karaoglu et al. Con-text: text detection using background connectivity for fine-grained object classification
Zhang et al. Multi-scale salient object detection with pyramid spatial pooling
CN105404682B (en) A kind of book retrieval method based on digital image content
CN108280388A (en) The method and apparatus and type of face detection method and device of training face detection model
CN109409224A (en) A kind of method of natural scene fire defector
CN111652288B (en) Improved SSD small target detection method based on dense feature pyramid
Gui et al. A fast caption detection method for low quality video images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170419