CN107545262A

CN107545262A - A kind of method and device that text is detected in natural scene image

Info

Publication number: CN107545262A
Application number: CN201710642311.1A
Authority: CN
Inventors: 王凯; 陈院林; 乔宇; 贺通
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2018-01-05
Anticipated expiration: 2037-07-31
Also published as: CN107545262B

Abstract

A kind of method and device that text is detected in natural scene image, to solve the problems, such as that detection text precision is relatively low from the natural scene image of differing complexity in the prior art.This method includes：Obtain natural scene image, pass through FCN models, convolution algorithm is carried out to the natural scene image of acquisition, obtain the convolution feature of natural scene image, according to the convolution feature of natural scene image, the text candidates regional sequence that natural scene image includes is determined, for each text candidates region in text candidates regional sequence, is performed：Pass through area-of-interest pond layer, extract the convolution feature in text candidates region, and pass through eigentransformation, by the convolution feature in text candidates region, it is converted into fixed dimension k characteristic vector, according to the characteristic vector of time Recursive Networks model and fixed dimension k, the position for the line of text that text candidates region includes is determined, wherein k is positive integer.

Description

A kind of method and device that text is detected in natural scene image

Technical field

The application is related to text detection techniques field, more particularly to a kind of method that text is detected in natural scene image And device.

Background technology

Natural scene image, refer to by various capture apparatus (for example, camera, there is the mobile phone etc. of shoot function), Under conditions of there is no specific limitation, the image directly to the scene capture of necessary being in life.Text in natural scene image Abundant semantic information originally can be provided, for example, the text message of street, car plate, menu etc. is identified in natural scene image, can Auxiliary people easily understand scene information, therefore, accurately detect that text is necessary in natural scene image.But It is, due to the factor such as difference and highly mixed and disorderly background such as font, color, form of natural scene image Chinese version, certainly It is a challenging job to detect text in right scene image.

At present, the method for text being detected in natural scene image can be divided into two major classes, be respectively：Based on sliding window Detection method and detection method based on connected domain.Specifically：

The operation principle of detection method based on sliding window is：Primitive nature is scanned using the sliding window of different scale Scene image, a series of natural scene image subregions that may include text are obtained, the texture for extracting these subregions is special Sign, and whether include text using the textural characteristics training grader of extraction, checking subregion, this method passes through multiple dimensioned cunning Dynamic window is constantly slided in natural scene image with certain step-length to extract subregion, the process of extraction and its time-consuming, and And whether include text using the textural characteristics checking subregion of low level, make its Detection results bad.

The operation principle of detection method based on connected domain is：Pass through the color of character pixels point, the stroke width of character Etc. feature, connected region is extracted from natural scene image, analyzes the feature of connected region, rule is merged by character, obtained Text-string, character string is verified, removal is non-legible, obtains final detection result, it is more simple that this method is suitable only for processing background Single natural scene image.

Above two method by rudimentary another characteristic, such as the stroke width of character, image texture characteristic etc., comes area Point natural scene image Chinese version and background, accuracy of detection is relatively low, therefore, how from the natural scene image of differing complexity In accurately detection text be urgent problem to be solved.

The content of the invention

The application provides a kind of method and device that text is detected in natural scene image, to solve in the prior art The problem of text precision is relatively low is detected from the natural scene image of differing complexity.

In a first aspect, this application provides a kind of method that text is detected in natural scene image, it is in the method, first Natural scene image is first obtained, by full convolutional network (fully convolutional networks, FCN) model, to institute The natural scene image for stating acquisition carries out convolution algorithm, the convolution feature of the natural scene image is obtained, according to the nature The convolution feature of scene image, the text candidates regional sequence that the natural scene image includes is determined, for the text Each text candidates region in the sequence of candidate region, perform：Pass through area-of-interest pond layer (roi-pooling), extraction The convolution feature in the text candidates region, and by eigentransformation, the convolution feature in the text candidates region is converted into Fixed dimension k characteristic vector, according to the characteristic vector of time Recursive Networks model and fixed dimension k, determine that the text is waited The position for the line of text that favored area includes, wherein k are positive integer, each text candidates in the text candidates regional sequence Region comprises at least a line of text, the single file text that text candidates region described in the text behavior includes.

The text candidates regional sequence in natural scene image, pin are detected by FCN models in the embodiment of the present application roughly To each text candidates region in text candidates regional sequence, passage time Recursive Networks model determines text candidates region Zhong Bao The position of the line of text included, and then realize the accurate detection to natural scene image Chinese version.Compared to passing through in the prior art Rudimentary another characteristic distinguishes the method for natural scene image Chinese version and background, based on FCN Model Fusion time Recursive Networks The method of model inspection natural scene image Chinese version, eliminate the reliance on the spy of the low levels such as stroke width, the image texture of character Sign, to distinguish the text and background in natural scene image, but passes through the depth of FCN models and time Recursive Networks model Learning ability, the semantic information of contextual information and text in natural scene image is made full use of, can accurately determined certainly The position for the line of text that right scene image includes.

In a kind of possible design, according to the convolution feature of the natural scene image, the natural scene image is determined The text candidates regional sequence included, including：By being merged to the convolution feature of the natural scene image, table is determined Levy the convolution feature of the natural scene image Chinese version position；Use the volume for characterizing the natural scene image Chinese version position Product feature, the natural scene image is mapped, and mark the text position in the natural scene image and it is described from Non-textual position in right scene image；At least one region of text position will be noted as in the natural scene image, It is defined as the text candidates regional sequence.

In above-mentioned design, by the convolution feature of the natural scene image of FCN extractions, determine that natural scene image includes Text candidates regional sequence method, based on the learning ability of FCN pixel scales, make full use of in natural scene image up and down The semantic information of literary information and text, the text in natural scene image and background are separated, and then determine natural scene As the text candidates regional sequence included, pass through rudimentary another characteristic, such as the stroke width of character, figure compared with prior art As textural characteristics etc., to distinguish the method for natural scene image Chinese version and background, the determination text that the embodiment of the present application provides The method of candidate region sequence, it can more accurately determine the position for the line of text that natural scene image includes.

In the embodiment of the present application, by the convolution feature of natural scene image, it is determined that text candidates regional sequence in wrap The different text candidates regions included it is of different sizes, for cause subsequently can be according to time Recursive Networks model to text candidates region Sequence carries out unified processing, and by roi-pooling in the embodiment of the present application, normalizing is carried out to text candidates regional sequence Change, by the convolution feature in text candidates region, be converted into fixed dimension k characteristic vector, the convolution in text candidates region is special Sign is converted into after fixed dimension k characteristic vector, according to the characteristic vector of time Recursive Networks model and fixed dimension k, really Determine the position for the line of text that text candidates region includes.

In a kind of possible design, the time Recursive Networks model includes N layers shot and long term memory (long short- Term memory, LSTM), wherein, N is set greater than the positive integer equal to maximum text number of lines, the maximum text line number Mesh is the text number of lines in the most text candidates region of the text number of lines that includes of the text candidates regional sequence.

It is described according to time Recursive Networks model and fixed dimension k based on the time Recursive Networks model including N layers LSTM Characteristic vector, determine the position for the line of text that the text candidates region includes, specifically include：By the fixed dimension k Characteristic vector, as the N layers LSTM time frame input, gradually input what the time Recursive Networks model included LSTM, wherein, the first layer for first only inputting the characteristic vector of the fixed dimension k in the time Recursive Networks model LSTM, the preceding layer LSTM results exported and the characteristic vector of the fixed dimension k are inputted into next layer each time afterwards LSTM, characteristic vector and the text position demarcated in advance using the fixed dimension k, to the time Recursive Networks model It is trained, obtains line of text candidate frame；The upper and lower of the line of text candidate frame, left and right edges are returned, detected and connected It is logical, determine the angle of inclination of the line of text candidate frame；According to the line of text candidate frame and the line of text candidate frame Angle of inclination, determine the position for the line of text that the text candidates region includes.

In above-mentioned design, by fixed dimension k characteristic vector, the N layers that gradually input time Recursive Networks model includes LSTM, in addition to first layer LSTM, each layer of LSTM afterwards inputs preceding layer LSTM testing result, based on N layer recurrence LSTM network design, time Recursive Networks model determine it is determined that during current text row candidate frame using previous LSTM The information of line of text candidate frame so that the determination of current text candidate frame is more accurate.Further, N layers LSTM time is passed through Recursive Networks model determines the angle of inclination of the line of text candidate frame, can be achieved to tilt the detection of text.

In a kind of possible design, after the position for determining the line of text that the text candidates region includes, in addition to：

By matching algorithm, the position for the line of text that the text candidates region determined is included and demarcation in advance Text position matched, it is determined that with the text position matching degree highest line of text demarcated in advance；Calculated by error Method determines the matching degree highest line of text, the error between the text position of demarcation, and according to the error, updates institute State the network parameter of FCN models and the time Recursive Networks model.

In above-mentioned design, by matching algorithm, it is determined that with the text position matching degree highest line of text demarcated in advance, should Process ensures the position for only retaining a matching degree highest line of text for one text row simultaneously, by above-mentioned design, makes It is more accurate to obtain the line of text detected in natural scene image.

In a kind of possible design, the N could be arranged to 5, may be arranged as other values certainly.

Second aspect, should be in natural scene this application provides a kind of device that text is detected in natural scene image The device that text is detected in image has the function of realizing above-mentioned first aspect method, and the function can be realized by hardware, Corresponding software can also be performed by hardware to realize.The hardware or software include one or more corresponding with above-mentioned function Module.The module can be software and/or hardware.

The third aspect, this application provides a kind of equipment, the equipment can include memory and processor.Wherein, deposit Reservoir is used for storage program, and the processor is used to perform the program in the memory, so as to perform first aspect or first The method that text is detected in natural scene image being related in any possible design of aspect.

Fourth aspect, present invention also provides a kind of computer-readable recording medium, is stored thereon with some instructions, these Instruction is called by computer when performing, and can cause computer completes any one of above-mentioned first aspect, first aspect may Design in the involved method that text is detected in natural scene image.

5th aspect, the application provide a kind of computer program product, and the computer program product calls by computer It can be completed during execution in first aspect and the arbitrarily possible design of above-mentioned first aspect involved in natural scene image The method of middle detection text.

Brief description of the drawings

Fig. 1 is a kind of schematic network structure that text is detected in natural scene image that the application provides；

Fig. 2 is a kind of method flow diagram that text is detected in natural scene image that the application provides；

Fig. 3 is the method stream for the text candidates regional sequence that a kind of determination natural scene image that the application provides includes Cheng Tu；

Fig. 4 is the method flow of the position for the line of text that a kind of determination text candidates region that the application provides includes Figure；

Fig. 5 is another schematic network structure that text is detected in natural scene image that the application provides；

Fig. 6 is that a kind of line of text that the application provides matches schematic diagram；

Fig. 7 is a kind of schematic device that text is detected in natural scene image that the application provides；

Fig. 8 is a kind of equipment that text is detected in natural scene image that the application provides.

Embodiment

In order that the purpose, technical scheme and advantage of the application are clearer, the application is implemented below in conjunction with accompanying drawing Example is described.

The embodiment of the present application provides a kind of method and device that text is detected in natural scene image, existing to solve The problem of text precision is relatively low is detected in technology from the natural scene image of differing complexity.Wherein, method and apparatus are Based on same inventive concept, it is similar to solve the principle of problem due to method and apparatus, therefore the implementation of apparatus and method can be with Cross-reference, repeat part and repeat no more.

The method and device that text is detected in natural scene image that the embodiment of the present application provides, can be applied in nature In the equipment that text is detected in scene image, for example, computer, tablet personal computer, smart mobile phone, server etc..

The application field of the embodiment of the present application includes but is not limited to, in natural scene image detect text field, The detection field of class text wisp field or other type of object is detected in natural scene image.

Fig. 1 shows a kind of network structure signal that text is detected in natural scene image that the embodiment of the present application provides Figure, as shown in fig.1, the network structure includes FCN models, time Recursive Networks model and roi-pooling, FCN model Natural scene image is obtained, by handling the natural scene image got, obtains the text in natural scene image Candidate region sequence, text candidates regional sequence is handled by roi-pooling, be fixed the feature of dimension to Amount, according to time Recursive Networks model and the characteristic vector of fixed dimension, determines the line of text that text candidates region includes Position.

It should be noted that the network structure that the embodiment of the present application detects text in natural scene image includes but unlimited Due to the network structure shown in Fig. 1.

In the embodiment of the present application, FCN models can be reconfigured based on existing convolutional neural networks structure, and the application is implemented The convolutional neural networks structure for constructing FCN models is not limited in example, for example, depth residual error network (deep can be passed through Residualnetworks, ResNet) in ResNet-101 network structures construction FCN models, specifically, by ResNet-101 Full articulamentum in the network architecture replaces with warp lamination, and convolutional layer and pond layer can then choose suitable number according to practical application Mesh.Wherein, it is made up of based on the FCN models that convolutional neural networks structure reconfigures convolutional layer and pond layer, no longer comprising complete Articulamentum so that the image of input can be arbitrary size, and can retain the spatial positional information of low resolution, can realize end To the prediction of the pixel scale at end.

Fig. 2 show a kind of method flow diagram that text is detected in natural scene image of the embodiment of the present application offer, As shown in fig.1, including：

S101：Obtain natural scene image.In the embodiment of the present application, natural scene image, refer to by various capture apparatus (for example, camera, there is the mobile phone etc. of shoot function), under conditions of no specific limitation, directly to truly being deposited in life Scene capture image.

It should be noted that the mode for obtaining natural scene image includes but is not limited to：Nature is gathered by sensing equipment Scene image, acquisition etc. from the database for be previously stored with natural scene image.The sensing equipment includes but is not limited to：Light Fine sensing equipment, picture pick-up device, collecting device etc..The database includes but is not limited to：Local data base, cloud database, U Disk, hard disk etc..

S102：By FCN models, convolution algorithm is carried out to natural scene image, the convolution for obtaining natural scene image is special Sign.In the embodiment of the present application, based on the FCN models constructed, convolution algorithm is carried out to the natural scene image got, obtained The convolutional layer of natural scene image, last convolutional layer of right scene image is obtained from by the warp lamination of FCN models Convolution feature, and then obtain the convolution feature of natural scene image.

S103：According to the convolution feature of natural scene image, the text candidates region that natural scene image includes is determined Sequence.Wherein, each text candidates region in text candidates regional sequence comprises at least a line of text, the text behavior The single file text that text candidates region includes.

In the embodiment of the present application, according to the convolution feature of natural scene image, the text determined in natural scene image Candidate region includes at least one line of text, and the final purpose that text is detected in natural scene image is that output is all only Vertical line of text, the embodiment of the present application is in order to accurately determine at least one line of text that the text candidates region includes, pin To each text candidates region in text candidates regional sequence, following S104 and S105 operation can be performed.

S104：By roi-pooling, the convolution feature in extraction text candidates region, and by eigentransformation, by text The convolution feature of candidate region, is converted into fixed dimension k characteristic vector, and k is positive integer.In the embodiment of the present application, go out below Existing fixed dimension k characteristic vector is identical with the characteristic vector implication of fixed dimension k herein.

In the embodiment of the present application, by the convolution feature of natural scene image, it is determined that text candidates regional sequence in wrap The different text candidates regions included it is of different sizes, for cause subsequently can be according to time Recursive Networks model to text candidates region Sequence carries out unified processing, and by roi-pooling in the embodiment of the present application, normalizing is carried out to text candidates regional sequence Change, by the convolution feature in text candidates region, be converted into fixed dimension k characteristic vector.

S105：According to the characteristic vector of time Recursive Networks model and fixed dimension k, determine that text candidates region includes Line of text position.

FCN and time Recursive Networks technology are based in the embodiment of the present application, by designing FCN Model Fusion time recurrence The network structure of network model, detects text in natural scene image, using FCN models and time Recursive Networks model from The effective feature representation of substantial amounts of natural scene image training sample learning, training can detect in natural scene image Go out the UNE of line of text, specifically, detecting the text candidates region sequence in natural scene image roughly by FCN models Row, for each text candidates region in text candidates regional sequence, passage time Recursive Networks model determines text candidates area The position for the line of text that domain includes, and then realize the accurate detection to natural scene image Chinese version.Compared to prior art In the method for natural scene image Chinese version and background is distinguished by rudimentary another characteristic, passed based on the FCN Model Fusion times Return the method for network model detection natural scene image Chinese version, it is rudimentary to eliminate the reliance on stroke width, the image texture of character etc. Another characteristic, to distinguish the text and background in natural scene image, but pass through FCN models and time Recursive Networks model Deep learning ability, make full use of the semantic information of contextual information and text in natural scene image, it is possible to achieve essence The position for the line of text that true determination natural scene image includes.

As shown in fig.3, for the convolution feature according to natural scene image, the text that natural scene image includes is determined The process of this candidate region sequence, is specifically described：

S201：By being merged to the convolution feature of natural scene image, it is determined that characterizing natural scene image Chinese version The convolution feature of position.

In the embodiment of the present application, the nature may include by the convolution feature of the natural scene image of FCN model extractions The feature of multiple dimensions (for example, it may be 1024 dimensions) of scene image, to determine natural scene image Chinese version candidate regions Domain sequence, by merging the convolution feature of natural scene image, determine to characterize nature field in natural scene image The convolution feature of scape image Chinese version position.

S202：Using the convolution feature for characterizing natural scene image Chinese version position, natural scene image is mapped, And the non-textual position in text position and natural scene image in natural scene image is marked by classification function.The application In embodiment, the classification to the non-textual position use in the text position and natural scene image in mark natural scene image Function does not limit, for example, it may be logistic functions, softmax functions etc..

S203：At least one region of text position will be noted as in natural scene image, be defined as text candidates area Domain sequence.

It may include multiple texts in the embodiment of the present application, in natural scene image, therefore, determined in natural scene image Text candidates regional sequence may include multiple text candidates regions, and each text candidates region includes at least one text OK, the final purpose that text is detected but in natural scene image be output it is text filed including text-independent row, the application After the text candidates regional sequence in natural scene image is determined by FCN models in embodiment, pass through roi- Pooling, the convolution feature in extraction text candidates region, and by eigentransformation, the convolution feature in text candidates region turns Fixed dimension k characteristic vector is turned to, specifically see S104, the convolution feature in text candidates region is converted into fixed dimension After k characteristic vector, according to the characteristic vector of time Recursive Networks model and fixed dimension k, determine in text candidates region Including line of text position.

In the embodiment of the present application, time Recursive Networks model may include N layer LSTM, wherein, N is set greater than being equal to maximum The positive integer of text number of lines, the maximum text number of lines are that the text number of lines that text candidates regional sequence includes is most Text number of lines in text candidates region.If for example, the text candidates number of regions determined in natural scene image be four, Text candidates region A, text candidates region B, text candidates region C and text candidates region D are designated as respectively, and by counting this Four text candidates region Chinese version number of lines, determine text candidates region A, text candidates region B, text candidates region C and The text number of lines that text candidates region D includes is respectively 2,3,1,2, then N is set greater than into the positive integer equal to 3.

The embodiment of the present application is subsequently so that time Recursive Networks model includes N layers LSTM as an example, to according to time Recursive Networks The characteristic vector of model and fixed dimension k, the process of the position for the line of text that text candidates region includes is determined, carried out specific Description, refering to shown in Fig. 4：

S301：By fixed dimension k characteristic vector, the time frame as N layers LSTM inputs, gradually input time recurrence net The LSTM that network model includes.

Wherein, first only by the first layer LSTM in fixed dimension k characteristic vector input time Recursive Networks model, it The characteristic vector of the preceding layer LSTM results exported and fixed dimension k is inputted into next layer of LSTM each time afterwards, utilizes fixation Dimension k characteristic vector and the text position demarcated in advance, time Recursive Networks model is trained, obtains line of text time Select frame.

In the embodiment of the present application, by fixed dimension k characteristic vector, the N that gradually input time Recursive Networks model includes Layer LSTM, in addition to first layer LSTM, each layer of LSTM afterwards inputs preceding layer LSTM testing result, passes through N layers The design of LSTM network models, time Recursive Networks model is it is determined that during current text row candidate frame, using previous determination Line of text candidate frame information so that the determination of current text candidate frame is more accurate.

S302：The upper and lower of line of text candidate frame, left and right edges are returned, detected and connected, determine line of text candidate The angle of inclination of frame.

In the embodiment of the present application, passage time Recursive Networks model is carried out to the upper and lower of line of text candidate frame, left and right edges Return, detection and connect, and then can determine that the angle of inclination of line of text candidate frame so that the embodiment of the present application offer in nature The method of text is detected in scene image can support to tilt the detection of text, middle compared with prior art to use rectangular text candidate frame It is determined that tilting the detection method of text, the method for the detection text that the embodiment of the present application provides, the positioning for tilting text is improved Precision, and the angle of inclination of the line of text candidate frame is determined by N layers LSTM time Recursive Networks model, it can be achieved to tilt The detection of text.

S303：According to the angle of inclination of line of text candidate frame and line of text candidate frame, text candidates region Zhong Bao is determined The position of the line of text included.

In the embodiment of the present application, N number of LSTM that passage time Recursive Networks model includes determines text candidates region one by one In single line of text position, and combine FCN network extractions text candidates region feature realize to line of text and text The accurate detection at capable angle of inclination.

In actual test, the text number of lines that the text candidates region that is determined in natural scene image includes is generally not More than 4, therefore, in a kind of possible design of the embodiment of the present application, the LSTM numbers of plies that time Recursive Networks model is included It is arranged to 5, i.e. above-mentioned N is arranged to 5, ensures the time Recursive Networks model designed by the embodiment of the present application, it may be determined that go out Line of text position in all text candidates regions, LSTM numbers of plies N in time Recursive Networks model is arranged to 5 network structure It see shown in Fig. 5.

It should be noted that if the text number of lines that text candidates region includes is less than N, by preceding M, (M is less than N's Positive integer) after layer LSTM determine the position for all line of text that text candidates region includes, remaining N subtracts M layers LSTM then Export as null value.

In the embodiment of the present application, it will not be exported in order by the position of the N layers LSTM line of text determined, if for example, text This candidate region includes 3 row single file texts, then the order exported by the position of the N layers LSTM line of text determined may be second Row text position, the first row text position, the third line text position, and the order of actual desired output is the first row text position Put, the second row text position, the third line text position；And flase drop there may be by the N layers LSTM line of text positions determined Survey, for example, text candidates region is actual including three lines of text, and four line of text positions are determined by N layers LSTM；Due to upper Problem is stated, the embodiment of the present application, by matching algorithm, is incited somebody to action it is determined that after the position for the line of text that text candidates region includes The position for the line of text that the text candidates region determined includes is matched with the text position demarcated in advance, it is determined that with it is pre- The text position matching degree highest line of text first demarcated, and matching degree highest line of text is determined by ERROR ALGORITHM, with mark Error between fixed text position, and according to the error, update the network parameter of whole UNE.

In the embodiment of the present application, by matching algorithm, the position for the line of text that the text candidates region determined is included Put and matched with the text position demarcated in advance, it is determined that with the text position matching degree highest line of text demarcated in advance, should It is determined that with the text position matching degree highest line of text demarcated in advance during, it is ensured that only protected for one text row Stay the position of a matching degree highest line of text.Specifically, above-mentioned matching process, can be by setting matching score graph to solicit articles this The position for the line of text that candidate region includes and the matching degree for the text position demarcated in advance, matching fraction is more high then to represent text The position for the line of text that this candidate region includes and the text position matching degree demarcated in advance are higher, by filtering out matching point Number obtains the text position matching degree highest line of text with demarcating in advance less than the position for the line of text for pre-setting threshold value.

With an example below the embodiment of the present application, the position of the line of text included to the text candidates region determined The process matched with the text position demarcated in advance illustrates, as shown in fig.6, assuming the text candidates of current detection Region includes two line of text, is represented in figure 6 with solid box, in the text candidates region currently determined by N layers LSTM Including line of text position be indicated by the dashed box, as dotted line frame 1,2,3 and 4 represents the text determined by N layers LSTM in Fig. 6 The line of text position that this candidate region includes, by matching algorithm, the dotted line frame 1,2,3 and 4 in Fig. 6 is entered with solid box Row matching, it is determined that with solid box matching degree highest dotted line frame, it is final in Fig. 6 to determine dotted line frame 2 and 4 with solid box matching degree most Height, thus in Fig. 6 can the line of text position according to corresponding to dotted line frame 2 and 4 determine line of text in natural scene image.

In the embodiment of the present application, the position of the line of text to the text candidates determined region is included is demarcated with advance The matching algorithm that is matched of text position do not limit, for example, it may be Hungary Algorithm (hungary-loss), its In, Hungary Algorithm is a kind of algorithm that bipartite graph maximum matching is sought with augmenting path, can effectively determine and demarcate in advance Text position matching degree highest line of text position.

In the embodiment of the present application, by determining the error between matching degree highest line of text and the text position of demarcation, The network parameter of whole UNE is adjusted, to improve the FCN time of fusion Recursive Networks performances of the embodiment of the present application design.

In the embodiment of the present application, the mistake of the error between the text position of matching degree highest line of text and demarcation pair is determined Difference algorithm does not limit, for example, it may be cross entropy error algorithm.

Based on above method embodiment same idea, the embodiment of the present application additionally provide one kind in natural scene image Detect the device of text.In the case of using integrated unit, Fig. 7 shows to detect the device of text in natural scene image A kind of logical construction schematic diagram, the device can be applied to the equipment that text is detected in natural scene image, as shown in fig.7, The device 100 of text is detected in natural scene image includes acquiring unit 101 and processing unit 102, wherein, acquiring unit 101, for obtaining natural scene image, acquiring unit 101 can be communication interface or transceiver that device possesses in itself etc., than As natural scene image is transferred to by wirelessly or non-wirelessly mode the transceiver or communication interface of device by remote equipment, certainly also Can be the input interface (such as the input interface such as keyboard, USB interface, touch-screen) that device possesses in itself, user can pass through Natural image scene is input in device by these input interfaces.Processing unit 102, for by FCN models, to the acquisition The natural scene image that unit 101 is got carries out convolution algorithm, obtains the convolution feature of the natural scene image, root According to the convolution feature of the natural scene image, the text candidates regional sequence that the natural scene image includes, pin are determined To each text candidates region in the text candidates regional sequence, perform：Pass through area-of-interest pond layer roi- Pooling, the convolution feature in the text candidates region is extracted, and by eigentransformation, by the volume in the text candidates region Product feature, is converted into fixed dimension k characteristic vector, k is positive integer, according to time Recursive Networks model and the fixed dimension K characteristic vector, determine the position for the line of text that the text candidates region includes.

Wherein, each text candidates region in the text candidates regional sequence comprises at least a line of text, described The single file text that text candidates region described in text behavior includes.

In a kind of possible design, the processing unit 102 can be especially by the convolution to the natural scene image Feature merges, it is determined that characterizing the convolution feature of the natural scene image Chinese version position；Use the sign natural field The convolution feature of scape image Chinese version position, maps the natural scene image, and described in being marked by classification function The non-textual position in text position and the natural scene image in natural scene image；By in the natural scene image At least one region of text position is noted as, is defined as the text candidates regional sequence.

In alternatively possible design, the time Recursive Networks model includes N layers shot and long term memory (long short- Term memory, LSTM), wherein, N is set greater than the positive integer equal to maximum text number of lines, the maximum text line number Mesh is the text number of lines in the most text candidates region of the text number of lines that includes of the text candidates regional sequence.Institute Processing unit 102 is stated, can be especially by by the characteristic vector of the fixed dimension k, the time frame as the N layers LSTM is defeated Enter, gradually input the LSTM that the time Recursive Networks model includes, wherein, first only by the feature of the fixed dimension k to The first layer LSTM that amount is inputted in the time Recursive Networks model, afterwards each time by the preceding layer LSTM results exported and The characteristic vector of the fixed dimension k inputs next layer of LSTM, characteristic vector and demarcation in advance using the fixed dimension k Text position, the time Recursive Networks model is trained, obtains line of text candidate frame；To the line of text candidate frame Up and down, left and right edges returned, detected and connected, determine the angle of inclination of the line of text candidate frame；According to the text The angle of inclination of one's own profession candidate frame and the line of text candidate frame, determine the line of text that the text candidates region includes Position.

In another possible design, the processing unit 102 can also be it is determined that the text candidates region includes Line of text position after, by matching algorithm, the position for the line of text that the text candidates region determined is included Put and matched with the text position demarcated in advance, it is determined that with the text position matching degree highest text demarcated in advance OK；The matching degree highest line of text is determined by ERROR ALGORITHM, the error between the text position of demarcation, and according to institute Error is stated, updates network parameter.

The N values referred in above-described embodiment can be, but not limited to be arranged to 5.

Division in the embodiment of the present application to module is schematical, only a kind of division of logic function, actual to realize When can have other dividing mode, in addition, each functional module in each embodiment of the application can be integrated at one Reason device in or be individually physically present, can also two or more modules be integrated in a module.It is above-mentioned integrated Module can both be realized in the form of hardware, can also be realized in the form of software function module.

Wherein, integrated module is carried when being realized in the form of hardware as shown in figure 8, Fig. 8 show the embodiment of the present application What is supplied detects the schematic diagram of the equipment 1000 of text in natural scene image.The equipment 1000 can be used for performing Fig. 2 to Fig. 4 In the method that is related to.As shown in figure 8, the equipment 1000 includes processor 1001 and memory 1002.The memory 1002 It is stored with computer program, instruction or code.The processor 1001 can be called and performed and be stored in the memory 1002 Program, instruction or code, to implement each step and function in above-mentioned embodiment, here is omitted.Above-mentioned processor 1001 embodiment can be referred to accordingly in acquiring unit 101 and processing unit 102 in above-mentioned Fig. 6 embodiments Illustrate, repeat no more here.

Designed it is understood that Fig. 8 illustrate only the simplifying for equipment that text is detected in natural scene image. In practical application, the equipment that text is detected in natural scene image is not limited to said structure, can divide in actual applications Not Bao Han any number of interface, processor and memory etc., and it is all can realize the embodiment of the present application in natural scene The equipment of text is detected in image all within the protection domain of the embodiment of the present application.

It will be further understood that the invention relates in natural scene image detect text device 100 and the equipment 1000 of text is detected in natural scene image, available for realizing the embodiment of the present application above method embodiment In corresponding function, therefore for the embodiment of the present application description not enough in detail place, see the description of related method embodiment, The embodiment of the present application will not be repeated here.

It will be further understood that the processor being related in the embodiment of the present application can be CPU (central processing unit, referred to as " CPU "), can also be other general processors, digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made programmable gate array (FPGA) or other PLDs, discrete gate Or transistor logic, discrete hardware components etc..General processor can be that microprocessor or the processor can also It is any conventional processor etc..

Memory can include read-only storage and random access memory, and provide instruction and data to processor.Deposit The a part of of reservoir can also include nonvolatile RAM.For example, memory can be with storage device type Information.

Bus system can also include power bus, controlling bus and status signal bus in addition in addition to including data/address bus Deng.But for the sake of clear explanation, various buses are all designated as bus system in figure.

In implementation process, each step being related in above method embodiment can be by the integrated of the hardware in processor The instruction of logic circuit or software form is completed.Text is detected in natural scene image with reference to disclosed in the embodiment of the present application The step of this method, can be embodied directly in hardware processor and perform completion, or with the hardware and software module in processor Combination performs completion.Software module can be located at random access memory, flash memory, read-only storage, programmable read only memory or In the ripe storage medium in this areas such as electrically erasable programmable memory, register.The storage medium is located in memory, place The information in device reading memory is managed, the step of above method embodiment is related to is completed with reference to its hardware.To avoid repeating, here It is not described in detail.

Based on additionally providing a kind of computer-readable storage medium with above method embodiment same idea, the embodiment of the present application Matter, some instructions are stored thereon with, when these instructions call execution by computer, it is real can make it that computer completes the above method Apply method involved in any one possible design of example, embodiment of the method.

Based on also providing a kind of computer program product with above method embodiment same idea, the application, the computer Program product can be set when being called and performing by computer so that Method Of Accomplishment embodiment and above method embodiment are arbitrarily possible Involved method in meter.

It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The application be with reference to according to the present processes, equipment (system) and computer program product flow chart and/or Block diagram describes.It should be understood that can by each flow in computer program instructions implementation process figure and/or block diagram and/or Square frame and the flow in flow chart and/or block diagram and/or the combination of square frame.These computer program instructions can be provided to arrive All-purpose computer, special-purpose computer, the processor of Embedded Processor or other programmable data processing devices are to produce one Machine so that produced by the instruction of computer or the computing device of other programmable data processing devices and flowed for realizing The device for the function of being specified in one flow of journey figure or multiple flows and/or one square frame of block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Obviously, those skilled in the art can carry out the essence of various changes and modification without departing from the application to the application God and scope.So, if these modifications and variations of the application belong to the scope of the application claim and its equivalent technologies Within, then the application is also intended to comprising including these changes and modification.

Claims

A kind of 1. method that text is detected in natural scene image, it is characterised in that including：

Obtain natural scene image；

By full convolutional network FCN models, convolution algorithm is carried out to the natural scene image, obtains the natural scene image Convolution feature；

According to the convolution feature of the natural scene image, the text candidates region sequence that the natural scene image includes is determined Row, wherein, each text candidates region in the text candidates regional sequence comprises at least a line of text, the line of text The single file text included for the text candidates region；

For each text candidates region in the text candidates regional sequence, perform：

By area-of-interest pond layer roi-pooling, the convolution feature in the text candidates region is extracted, and passes through feature Conversion, by the convolution feature in the text candidates region, is converted into fixed dimension k characteristic vector, k is positive integer；

According to the characteristic vector of time Recursive Networks model and the fixed dimension k, determine that the text candidates region includes Line of text position.
2. the method as described in claim 1, it is characterised in that according to the convolution feature of the natural scene image, determine institute The text candidates regional sequence that natural scene image includes is stated, including：

By being merged to the convolution feature of the natural scene image, it is determined that characterizing described natural scene image Chinese one's own department or unit The convolution feature put；

Using the convolution feature for characterizing the natural scene image Chinese version position, the natural scene image is mapped, And the non-textual position in text position and the natural scene image in the natural scene image is marked by classification function Put；

At least one region of text position will be noted as in the natural scene image, be defined as the text candidates region Sequence.
3. method as claimed in claim 2, it is characterised in that the time Recursive Networks model is remembered including N layers shot and long term LSTM, wherein, N is set greater than the positive integer equal to maximum text number of lines, and the maximum text number of lines is the text Text number of lines in the most text candidates region of text number of lines that candidate region sequence includes；

The characteristic vector according to time Recursive Networks model and fixed dimension k, determine that the text candidates region includes Line of text position, including：

By the characteristic vector of the fixed dimension k, the time frame as the N layers LSTM inputs, and gradually inputs the time and passs Return the LSTM that network model includes, wherein, the characteristic vector of the fixed dimension k is only inputted into the time Recursive Networks first First layer LSTM in model, afterwards each time by the preceding layer LSTM results exported and the feature of the fixed dimension k to Amount next layer of LSTM of input, characteristic vector and the text position demarcated in advance using the fixed dimension k, to the time Recursive Networks model is trained, and obtains line of text candidate frame；

The upper and lower of the line of text candidate frame, left and right edges are returned, detected and connected, determine the line of text candidate frame Angle of inclination；

According to the angle of inclination of the line of text candidate frame and the line of text candidate frame, determine in the text candidates region Including line of text position.
4. method as claimed in claim 3, it is characterised in that determine the position for the line of text that the text candidates region includes After putting, methods described also includes：

By matching algorithm, by the position of line of text that the text candidates region determined includes and the text demarcated in advance This position is matched, it is determined that with the text position matching degree highest line of text demarcated in advance；

The matching degree highest line of text is determined by ERROR ALGORITHM, the error between the text position of demarcation, and according to The error, update network parameter.
5. the method as described in claim 3 or 4, it is characterised in that the N is arranged to 5.
A kind of 6. device that text is detected in natural scene image, it is characterised in that including：

Acquiring unit, for obtaining natural scene image；

Processing unit, for by full convolutional network FCN models, carrying out convolution algorithm to the natural scene image, obtaining institute The convolution feature of natural scene image is stated, according to the convolution feature of the natural scene image, determines the natural scene image The text candidates regional sequence included, wherein, each text candidates region in the text candidates regional sequence is at least wrapped Include a line of text, the single file text that text candidates region described in the text behavior includes, for the text candidates area Each text candidates region in the sequence of domain, perform：By area-of-interest pond layer roi-pooling, the text is extracted The convolution feature of candidate region, and by eigentransformation, by the convolution feature in the text candidates region, it is converted into fixed dimension K characteristic vector, k are positive integer, according to the characteristic vector of time Recursive Networks model and the fixed dimension k, it is determined that described The position for the line of text that text candidates region includes.
7. device as claimed in claim 6, it is characterised in that the processing unit is in the volume according to the natural scene image Product feature, when determining the text candidates regional sequence that the natural scene image includes, is specifically used for：

By being merged to the convolution feature of the natural scene image, it is determined that characterizing described natural scene image Chinese one's own department or unit The convolution feature put；

Using the convolution feature for characterizing the natural scene image Chinese version position, the natural scene image is mapped, And the non-textual position in text position and the natural scene image in the natural scene image is marked by classification function Put；

At least one region of text position will be noted as in the natural scene image, be defined as the text candidates region Sequence.
8. device as claimed in claim 7, it is characterised in that the time Recursive Networks model is remembered including N layers shot and long term LSTM, wherein, N is set greater than the positive integer equal to maximum text number of lines, and the maximum text number of lines is the text Text number of lines in the most text candidates region of text number of lines that candidate region sequence includes；

The processing unit determines the text candidates area according to the characteristic vector of time Recursive Networks model and fixed dimension k During the position for the line of text that domain includes, it is specifically used for：

By the characteristic vector of the fixed dimension k, the time frame as the N layers LSTM inputs, and gradually inputs the time and passs Return the LSTM that network model includes, wherein, the characteristic vector of the fixed dimension k is only inputted into the time Recursive Networks first First layer LSTM in model, afterwards each time by the preceding layer LSTM results exported and the feature of the fixed dimension k to Amount next layer of LSTM of input, characteristic vector and the text position demarcated in advance using the fixed dimension k, to the time Recursive Networks model is trained, and obtains line of text candidate frame；

The upper and lower of the line of text candidate frame, left and right edges are returned, detected and connected, determine the line of text candidate frame Angle of inclination；

According to the angle of inclination of the line of text candidate frame and the line of text candidate frame, determine in the text candidates region Including line of text position.
9. device as claimed in claim 8, it is characterised in that the processing unit, be additionally operable to：

It is determined that after the position for the line of text that the text candidates region includes, by matching algorithm, the institute that will be determined The position for stating the line of text that text candidates region includes is matched with the text position demarcated in advance, it is determined that with it is described in advance The text position matching degree highest line of text of demarcation；

The matching degree highest line of text is determined by ERROR ALGORITHM, the error between the text position of demarcation, and according to The error, update network parameter.
10. device as claimed in claim 8 or 9, it is characterised in that the N is arranged to 5.
11. a kind of equipment, it is characterised in that including being detected in natural scene image described in any one of claim 6 to 10 The device of text.
12. a kind of computer-readable recording medium, it is characterised in that the computer-readable recording medium storage has computer to refer to Order, when the instruction is run on computers so that computer performs the method as described in claim 1-5 is any.
13. a kind of computer program product, it is characterised in that the computer program product by computer when being called so that Computer performs the method as described in claim 1-5 is any.