CN109271967A

CN109271967A - The recognition methods of text and device, electronic equipment, storage medium in image

Info

Publication number: CN109271967A
Application number: CN201811202558.2A
Authority: CN
Inventors: 刘铭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2019-01-25
Anticipated expiration: 2038-10-16
Also published as: CN109271967B

Abstract

Present invention discloses a kind of recognition methods of text in image and device, electronic equipment, computer readable storage mediums, the program executes the end-to-end identification of text in image by the network model of multiple-layer stacked, the program includes: that the separable convolution operation in space of image is successively carried out by multilayer mode, space is separated into the convolution Fusion Features of convolution operation extraction to the mapped low layer that is layering, the high-rise phase mapping of low layer and output convolution feature；Global characteristics are obtained from the bottom for executing the separable convolution operation in space；The candidate region detection and the prediction of region screening parameter that text in image is carried out by global characteristics, obtain to correspond to and detect to obtain text filed pond feature；By pond feature back-propagating to the identification branched network network layers of execution character identification operation, by identifying that branched network network layers export the character string of text filed label.The program saves the model training time, improves identification accuracy.

Description

The recognition methods of text and device, electronic equipment, storage medium in image

Technical field

The present invention relates to technical field of image processing, in particular to the recognition methods of text and device, electricity in a kind of image Sub- equipment, computer readable storage medium.

Background technique

In Computer Image Processing field, text identification, which refers to, allows computer automatically to differentiate that the character in image belongs to word Which of Fu Ku word, character repertoire are established in advance by people, generally comprise most common character in actual life.

The identification of text in image, usually by building two models, a model is used for oneself comprising text at one Text position is found out in right scene image, is then cut out from image text filed.Another model goes out for identification Text filed specific character content.Specifically, first obtaining the great amount of samples image comprising kinds of characters as training set, utilize These sample images carry out the training of character classifier and the training of String localization device respectively.After the completion of training, text is first passed through This locator oriented from testing image it is text filed, then cut out it is text filed, recycle character classifier identify Text filed character content.

Above scheme needs to carry out the training of character classifier and the instruction of String localization device respectively using these sample images Practice, the larger workload of model training, and the identification accuracy of final character, is influenced by two model accuracys rate, by This limits the promotion of text recognition accuracy in image.

Summary of the invention

In order to solve to need the training for carrying out character classifier respectively and the instruction of String localization device present in the relevant technologies Practice, the larger workload of model training, the not high problem of identification accuracy, the present invention provides a kind of identifications of text in image Method.

The present invention provides a kind of recognition methods of text in image, and the method is executed by the network model of multiple-layer stacked The end-to-end identification of text in image, which comprises

The space that image is successively carried out by multilayer mode separates convolution operation, and the space is separated convolution operation The convolution Fusion Features of extraction to the mapped low layer that is layering, the low layer sets each other off with the high level for exporting the convolution feature It penetrates；

Global characteristics are obtained from the bottom for executing the separable convolution operation in space；

The candidate region detection and the prediction of region screening parameter of text in image, acquisition pair are carried out by the global characteristics It should be in detecting to obtain text filed pond feature；

By the pond feature back-propagating to the identification branched network network layers of execution character identification operation, pass through the identification Branched network network layers export the character string of the text filed label.

On the other hand, the present invention provides a kind of identification device of text in image, described device passes through multiple-layer stacked Network model executes the end-to-end identification of text in image, and described device includes:

Spatial convoluted operation module, the space for successively carrying out image by multilayer mode separate convolution operation, will The space separates convolution Fusion Features that convolution operation is extracted to the mapped low layer that is layering, the low layer and output The high-rise phase mapping of the convolution feature；

Global characteristics extraction module, for obtaining global characteristics from the bottom for executing the separable convolution operation in space；

Pond feature obtains module, for carrying out candidate region detection and the area of text in image by the global characteristics Screening parameter prediction in domain obtains to correspond to and detects to obtain text filed pond feature；

Character string output module, for the identification point by the pond feature back-propagating to execution character identification operation Branch network layer exports the character string of the text filed label by the identification branched network network layers.

On the other hand, the present invention also provides a kind of electronic equipment, the electronic equipment includes:

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to executing the recognition methods for completing text in above-mentioned image.

In addition, the present invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage There is computer program, the computer program can be executed the recognition methods for completing text in above-mentioned image by processor.

The technical solution that the embodiment of the present invention provides can include the following benefits:

Technical solution provided by the invention executes the end-to-end knowledge of text in image by the network model of multiple-layer stacked , thus need to only not train a network model that the identification of text in image can be realized, without separate training text locator and Character classifier, the accuracy for reducing the workload of model training, and finally identifying, only by the accurate of network model Property influence, mutual limitation of the promotion by two models of identification accuracy can be avoided in favor of identifying the raising of accuracy.

It should be understood that the above general description and the following detailed description are merely exemplary, this can not be limited Invention.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and in specification together principle for explaining the present invention.

Fig. 1 is the schematic diagram of related implementation environment according to the present invention；

Fig. 2 is a kind of block diagram of device shown according to an exemplary embodiment；

Fig. 3 is the flow chart of the recognition methods of text in a kind of image shown according to an exemplary embodiment；

Fig. 4 is the network architecture schematic diagram that space separates convolutional network layer；

Fig. 5 is the network architecture schematic diagram of Text region in a kind of image relatively proposed by the present invention；

Fig. 6 is the details flow chart of step 350 in Fig. 3 corresponding embodiment；

Fig. 7 is the schematic illustration that pond layer extracts Pixel-level region screening parameter from global characteristics；

Fig. 8 is the details flow chart of step 353 in Fig. 6 corresponding embodiment；

Fig. 9 is the details flow chart of step 370 in Fig. 3 corresponding embodiment；

Figure 10 is that identification branched network network layers are configuration diagrams；

Figure 11 is the network architecture schematic diagram of the recognition methods of text in image provided by the invention；

Figure 12 is the stream of the recognition methods of text in the image of another embodiment offer on the basis of Fig. 3 corresponding embodiment Cheng Tu；

Figure 13 is the details flow chart of step 1230 in Figure 12 corresponding embodiment；

Figure 14 is the details flow chart of step 1231 in Figure 13 corresponding embodiment；

Figure 15 is practical application effect schematic diagram of the present invention.

Figure 16 is the block diagram of the identification device of text in a kind of image shown according to an exemplary embodiment；

Figure 17 is the details block diagram that pond feature obtains module in Figure 16 corresponding embodiment；

Figure 18 is the details block diagram that rotary unit is screened in Figure 17 corresponding embodiment.

Specific embodiment

Here will the description is performed on the exemplary embodiment in detail, the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.

Fig. 1 is the schematic diagram of related implementation environment according to the present invention.The implementation environment includes: user equipment 110, is used Family equipment 110 can carry out the identification of text in image by operation application program.User equipment can be server, desktop Brain, mobile terminal, intelligent appliance etc..

User equipment 110 can have the image capture devices such as camera 111, and then use method pair provided by the invention The image that image capture device 111 acquires carries out text identification.

As needed, which can also include server 130, server 130 and use in addition to user equipment 110 It being connected between family equipment 110 by wired or wireless network, images to be recognized is sent to user equipment 110 by server 130, into And the identification of text in image is carried out by user equipment 110.

In practical applications, the content of text identified from image can be translated with further progress text, in text Hold editor, storage etc..The recognition methods of text can be applied to the text identification under any scene times in image provided by the invention Business realizes text content understanding in image, such as natural scene text picture, advertising pictures, video, identity card, driver's license, name Text region in piece, license plate.

Fig. 2 is a kind of block diagram of device 200 shown according to an exemplary embodiment.For example, device 200 can be Fig. 1 User equipment 110 in shown implementation environment.

Referring to Fig. 2, device 200 may include following one or more components: processing component 202, memory 204, power supply Component 206, multimedia component 208, audio component 210, sensor module 214 and communication component 216.

The integrated operation of the usual control device 200 of processing component 202, such as with display, telephone call, data communication, phase Machine operation and the associated operation of record operation etc..Processing component 202 may include one or more processors 218 to execute Instruction, to complete all or part of the steps of following methods.In addition, processing component 202 may include one or more modules, Convenient for the interaction between processing component 202 and other assemblies.For example, processing component 202 may include multi-media module, with convenient Interaction between multimedia component 208 and processing component 202.

Memory 204 is configured as storing various types of data to support the operation in device 200.These data are shown Example includes the instruction of any application or method for operating on the device 200.Memory 204 can be by any kind of Volatibility or non-volatile memory device or their combination are realized, such as static random access memory (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), Erasable Programmable Read Only Memory EPROM (Erasable Programmable Read Only Memory, abbreviation EPROM), programmable read only memory (Programmable Red- Only Memory, abbreviation PROM), read-only memory (Read-Only Memory, abbreviation ROM), magnetic memory, flash Device, disk or CD.One or more modules are also stored in memory 204, which is configured to by this One or more processors 218 execute, complete in any shown method of following Fig. 3, Fig. 6, Fig. 8, Fig. 9, Figure 12-Figure 14 to complete Portion or part steps.

Power supply module 206 provides electric power for the various assemblies of device 200.Power supply module 206 may include power management system System, one or more power supplys and other with for device 200 generate, manage, and distribute the associated component of electric power.

Multimedia component 208 includes the screen of one output interface of offer between described device 200 and user.One In a little embodiments, screen may include liquid crystal display (Liquid Crystal Display, abbreviation LCD) and touch panel. If screen includes touch panel, screen may be implemented as touch screen, to receive input signal from the user.Touch panel Including one or more touch sensors to sense the gesture on touch, slide, and touch panel.The touch sensor can be with The boundary of a touch or slide action is not only sensed, but also detects duration associated with the touch or slide operation and pressure Power.Screen can also include display of organic electroluminescence (Organic Light Emitting Display, abbreviation OLED).

Audio component 210 is configured as output and/or input audio signal.For example, audio component 210 includes a Mike Wind (Microphone, abbreviation MIC), when device 200 is in operation mode, such as call model, logging mode and speech recognition mould When formula, microphone is configured as receiving external audio signal.The received audio signal can be further stored in memory 204 or via communication component 216 send.In some embodiments, audio component 210 further includes a loudspeaker, for exporting Audio signal.

Sensor module 214 includes one or more sensors, and the state for providing various aspects for device 200 is commented Estimate.For example, sensor module 214 can detecte the state that opens/closes of device 200, the relative positioning of component, sensor group Part 214 can be with the position change of 200 1 components of detection device 200 or device and the temperature change of device 200.Some In embodiment, which can also include Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 216 is configured to facilitate the communication of wired or wireless way between device 200 and other equipment.Device 200 can access the wireless network based on communication standard, such as WiFi (WIreless-Fidelity, Wireless Fidelity).Show at one In example property embodiment, communication component 216 receives broadcast singal or broadcast from external broadcasting management system via broadcast channel Relevant information.In one exemplary embodiment, the communication component 216 further includes near-field communication (Near Field Communication, abbreviation NFC) module, to promote short range communication.For example, radio frequency identification (Radio can be based in NFC module Frequency Identification, abbreviation RFID) technology, Infrared Data Association (Infrared Data Association, abbreviation IrDA) technology, ultra wide band (Ultra Wideband, abbreviation UWB) technology, Bluetooth technology and other skills Art is realized.

In the exemplary embodiment, device 200 can be by one or more application specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), it is digital signal processor, digital signal processing appts, programmable Logical device, field programmable gate array, controller, microcontroller, microprocessor or other electronic components are realized, for executing Following methods.

Fig. 3 is the flow chart of the recognition methods of text in a kind of image shown according to an exemplary embodiment.The image The scope of application and executing subject of the recognition methods of middle text can be user equipment, which can be real shown in Fig. 1 Apply the user equipment 110 of environment.This method executes the end-to-end identification of text in image by the network model of multiple-layer stacked.Its In, end-to-end identification refers to that the input of network model is raw image data, and output is last character string.As shown in figure 3, This method specifically includes following steps.

In the step 310, the space that image is successively carried out by multilayer mode separates convolution operation, can by the space The convolution Fusion Features of convolution operation extraction are separated to the mapped low layer that is layering, the low layer and the output convolution are special The high-rise phase mapping of sign.

It should be noted that the network model of multiple-layer stacked may include that space separates convolutional network layer, region returns Network layer, pond layer, time convolutional network layer, character classification layer.Wherein, space separates convolutional network layer, region returns net Network layers and pond layer are as detection branches, for extracting pond feature text filed in image according to raw image data, when Between convolutional network layer and character classification layer as identification branch, for according to text filed pond feature, output to be text filed Character string.

Specifically, space separates convolution operation, to refer to that space separates convolution (Effnet) layer layer-by-layer by multilayer mode Convolutional calculation is carried out to image to be identified.Wherein, space separates the high level and low layer that convolutional layer includes phase mapping, it is high-rise and Low layer is relative concept, precalculated to be known as high level, and what is calculated afterwards is known as low layer.The convolution feature that high-rise convolutional calculation is extracted It is fused to the mapped low layer that is layering, refers to that the convolutional calculation result of low layer needs to combine high-rise convolutional calculation result. Because the convolution number of plies is more, the details of loss is more, and the convolution Fusion Features by extracting high level can retain more to low layer More details, avoids information from losing.

In a step 330, global characteristics are obtained from the bottom for executing the separable convolution operation in space.

Wherein, the bottom refers to that space separates the last output layer of convolutional layer, and space separates convolutional layer and passes through multilayer Mode successively carries out space to original image to be identified and separates convolution operation, and the eigenmatrix finally exported is known as the overall situation Feature.Global characteristics can be used for characterizing the characteristic information of original input picture.

Fig. 4 is the network architecture schematic diagram that space separates convolutional layer, as shown in figure 4, original image conduct to be identified Space separates the input of convolutional layer, later layer-by-layer progress convolutional calculation, the low layer of the Fusion Features that high level extracts to phase mapping, Global characteristics are exported in the bottom that space separates convolutional layer.Wherein, each parallelogram represents the convolution of every layer of extraction Feature.

In step 350, the candidate region detection and region screening ginseng of text in image are carried out by the global characteristics Number prediction obtains to correspond to and detects to obtain text filed pond feature.

Wherein, candidate region detection, which refers to, detects candidate region in image where text according to global characteristics, candidate Region can be multiple.Screening parameter prediction in region refers to the predicted value that region screening parameter is obtained according to global characteristics, according to These predicted values can carry out the screening of candidate region, improve detection accuracy text filed in image.Text filed pond Feature refers to the text filed characteristic of pond layer output, and in one embodiment, text filed pond feature can be with It is the image data after text filed complanation, complanation, which refers to, text filed rotates to horizontal position for inclined.

Specifically, space separate convolutional network layer output global characteristics can distinguish input area Recurrent networks layer and Pond layer is detected by the candidate region that region Recurrent networks layer carries out text in image, exports the candidate region of text borders, Abbreviation frame candidate region.The prediction that convolution transform realizes region screening parameter is carried out to global characteristics by pond layer, in turn Pond layer screens frame candidate region according to region screening parameter, can detecte out it is text filed in image, in turn To it is inclined it is text filed rotate, the text filed image data of complanation is obtained, as text filed Chi Huate Sign.

In step 370, the identification branched network network layers pond feature back-propagating operated to execution character identification, The character string of the text filed label is exported by the identification branched network network layers.

Wherein, identification branched network network layers are the last several layers of the network model of multiple-layer stacked, for according to text filed Pond feature, identify it is text filed contained in character.Specifically, identification branched network network layers include the time volume of network model Product network layer and character classification layer.Specifically, pond layer leads to text filed pond feature propagation to time convolutional network layer It crosses time convolutional network layer and convolutional calculation is carried out to pond feature, extract character string feature, and then character string feature is passed Character classification layer is transported to, the probability that each character belongs to each character in dictionary is exported by character classification layer.

As an example it is assumed that including 7439 texts in dictionary, then character classification layer can export each in text filed Character belongs to the probability of each text in dictionary, then in dictionary the text of maximum probability be exactly the character in text filed identification As a result, can export the recognition result of each character accordingly, for multiple characters in text filed, obtain text filed mark The character string of note.

The technical solution that the above exemplary embodiments of the present invention provide is executed in image by the network model of multiple-layer stacked Thus the end-to-end identification of text need to only train a network model that the identification of text in image can be realized, without separately instruction Practice String localization device and character classifier, the accuracy for reducing the workload of model training, and finally identifying, only by one The accuracy of network model influences, and in favor of the raising of identification accuracy, can avoid the promotion of identification accuracy by two The mutual limitation of model.

Relative to the technical solution that the above exemplary embodiments of the present invention provide, Fig. 5 is a kind of stream of Text region scheme Cheng Tu.As shown in figure 5, the detection of text and identification are divided into two tasks by the text identifying schemes, after the completion of Detection task It can be carried out identification mission.Specifically, original image input feature vector is extracted convolutional network first when detection, then by the spy of extraction Sign is transmitted to region Recurrent networks, and region Recurrent networks can export the frame candidate region of detection, but these regions are also relatively rough Need to do further frame to return, to improve the accuracy of frame so as to more close to text edge, secondary frame return and A possibility that classification can provide the coordinate of text border and corresponding confidence level in image, that is, include text.The two predictions As a result it can be compared with text labeling position in image, then calculate prediction loss by loss function, by this damage It loses to adjust the parameter of model and update.

When tilting text due to detection, very big blank area is had above the frame candidate region of region Recurrent networks detection Domain, this can reduce the precision of detection block, therefore the frame candidate region that region Recurrent networks are exported and feature extraction convolution net The global characteristics that network extracts are input to jointly on rotation interest pool area layer, to obtain the inclination character area of detection.Such as Fig. 5 Shown, inclination character area is come out in original image with text box formal notation, later according to the coordinate of text box from original Corresponding region is cut out in image, first completes the positioning of text region as a result,.It should be noted that in this stage, The position error of existing text region.

Later, by the text region image input identification network of cutting, identification network first can be to the region of input Image does convolution feature extraction, and the convolution feature of extraction is supplied to character classification layer later, is identified by the character classification layer defeated The character string for entering sequence expression, after character areas all in original image all complete identification, the text of the original image is known Other task is completed.It should be noted that being also required to calculate the output of character classification layer by another loss function in this stage Character string and actual characters sequence difference, adjusted by this difference identification network and character classification layer parameter more Newly.That is, there is also the errors of character recognition in this stage, so final whole identification error, including literal field The error of domain positioning and the error of character recognition.

It should be noted that, even if the accuracy of character recognition improves, also being limited if character area position error is larger The raising of whole identification accuracy is made.Region detection and Text region are separately trained to the promotion for being unfavorable for performance, identification The error that stage generates can not also be communicated to parameter of the detection part to correct detection model, lead to the table on some training sets It now will receive the bottleneck of detection or recognition performance.Also, separately detection model and identification model is trained to will increase model training Workload.And the speed that feature extraction convolutional network extracts feature is slower, influences the number of whole system unit time processing task Mesh, while being also unfavorable for model and being disposed in mobile terminal.

And the present invention realizes the end-to-end identification of text in image by the network model of multiple-layer stacked, i.e., input is original Image, output are character strings, and the accuracy of final character recognition is only determined by the error of a model, by two tasks from mould One is synthesized in type, avoids separated trained bring performance bottleneck, is thus conducive to the raising of identification accuracy；And it only trains The identification of text can be realized in one network model, and the time of model training is greatly saved, and more separately trains two models extremely Half the time is saved less, and in practice because the parameter setting of two models is different, it counts in adjusting the time of ginseng that can save 4,5 Time again.In addition, the present invention separates convolution operation using the space that the Effnet network architecture carries out image, space can be divided Convolution Fusion Features from convolution operation extraction both may be implemented to extract rank to global characteristics to the mapped low layer that is layering The acceleration of section, while making up existing acceleration network structure and realizing the defect for needing to sacrifice model accuracy while acceleration, and Memory space needed for reducing model running, convenient in mobile terminal application deployment.

In a kind of exemplary embodiment, as shown in fig. 6, the above-mentioned steps 350 specifically include:

In step 351, the global characteristics are inputted to the region Recurrent networks layer for executing candidate region detection, pass through institute State the frame candidate region of text in region Recurrent networks layer output described image；

It is to be understood that the present invention executes the end-to-end identification of text in image by the network model of multiple-layer stacked, And region Recurrent networks layer is the wherein several layers of the network model, for detect text may where region, that is, hold The detection of row candidate region.

Specifically, separate convolutional network layer by the space of the network model extracts global characteristics from original image, And by global characteristics input area Recurrent networks layer, the frame candidate regions of text in image are exported by region Recurrent networks layer Domain.Frame candidate region refers to the region that text edges may surround.It, can be defeated by region Recurrent networks layer in the training stage The frame candidate region of text out, by carrying out secondary frame recurrence and classification, the candidate detected to frame candidate region Frame and candidate frame confidence level (a possibility that including text) calculate multitask loss according to the position coordinates of actual text frame, By adjusting the parameter of Local Area Network network layer, loss is made to reach minimum.Wherein, Recurrent networks layer in region can be Faster- R-CNN (fast target detection convolutional neural networks), the main contributions of Faster-R-CNN, which devise, extracts candidate region The network architecture, instead of time-consuming selective search, so that detection speed greatly improves.

In step 352, the frame candidate region is inputted to the pond layer for executing region screening and region rotation；

Wherein, layer connection space in pond separates convolutional network layer, for separating the output of convolutional network layer according to space Global characteristics, to frame candidate region execute region screening and region rotate.Wherein, region screening refers to waits from multiple frames Accurate text region is filtered out in favored area, region rotation refers to inclined text filed rotation to horizontal position. It is common to separate the global characteristics that convolutional network layer exports for the frame candidate region and space of region Recurrent networks layer output as a result, Input pond layer.

In step 353, the picture that screening parameter prediction in region obtains is carried out to the global characteristics according to the pond layer Plain grade region screening parameter is filtered out described text filed from the frame candidate region and is rotated described text filed to water Prosposition is set, and the text filed pond feature is obtained.

Wherein, Pixel-level region screening parameter, which refers to, sieves frame candidate region according to what global characteristics were predicted The parameter of choosing and rotation.Pixel-level region screening parameter may include Pixel-level classification confidence, Pixel-level rotation angle and picture Plain grade frame distance.The text filed region referred to where text.Pond layer can by a variety of convolution kernels to global characteristics into Row convolution transform, obtains Pixel-level region screening parameter, and then according to Pixel-level region screening parameter from multiple frame candidate regions Filtered out in domain it is text filed, and then by it is inclined it is text filed rotation to horizontal position, obtain text filed pond feature.

As shown in fig. 7, global characteristics pass through the transformation of first convolution kernel, output pixel grade classification confidence is that is, original Each pixel belongs to the probability of text in image.Global characteristics pass through second convolution kernel transformation, output pixel grade frame away from From Prediction distance of that is, each pixel apart from locating text borders up and down.Global characteristics are by third convolution kernel Transformation, output pixel grade rotate angle, i.e., the angle for needing to rotate when each pixel rotates to horizontal position.

In a kind of exemplary embodiment, as shown in figure 8, above-mentioned steps 353 specifically include:

In step 3531, obtains the Pixel-level that the pond layer carries out convolutional calculation generation to the global characteristics and classify Confidence level, the Pixel-level classification confidence refer to that each pixel belongs to text filed probability in described image；

Specifically, pond layer can by size be 1 × 1, step-length be 1 convolution kernel to global characteristics (characteristic image) into Row convolutional calculation exports the confidence level prediction result that each pixel belongs to text, obtains Pixel-level classification confidence.Confidence level High pixel indicates that the pixel belongs to that text filed probability is larger, and similarly, the low expression pixel of confidence level belongs to text The probability of one's respective area is smaller.

In step 3532, according to the friendship of the Pixel-level classification confidence and the frame candidate region and ratio, It is filtered out from the frame candidate region described text filed；

Wherein, simultaneously ratio refers to the overlap proportion between different frame candidate regions for the friendship of frame candidate region.Due to side There are noise frame, thus friendship and ratio of the present invention according to Pixel-level classification confidence and frame candidate region in frame candidate region Example carries out non-maxima suppression to the testing result of frame candidate region, to filter out text area from frame candidate region The accuracy of text filed detection is improved in domain.

Specifically, the high side of confidence level can be retained according to Pixel-level classification confidence by non-maxima suppression algorithm Frame candidate region retains unfolded frame candidate region, retains and hands over the simultaneously low frame candidate region of ratio, thus from all Screening obtains text filed in frame candidate region.

In step 3533, rotated according to the Pixel-level that the pond layer carries out convolutional calculation generation to the global characteristics The text filed rotation is obtained the text area to horizontal position by interpolation algorithm by angle and Pixel-level frame distance The pond feature in domain.

It should be noted that pond layer when obtaining Pixel-level classification confidence, can simultaneously roll up global characteristics Product calculates, and obtains Pixel-level rotation angle and Pixel-level frame distance.Referring to being explained above, Pixel-level rotation angle refers to a picture The angle for needing to rotate when vegetarian refreshments rotates to horizontal position, Pixel-level frame distance refer to each pixel apart from locating text side The Prediction distance of frame up and down.Specifically, pond layer can be 1 × 1 by size, the convolution kernel that step-length is 4 is to global special Sign carries out convolutional calculation, exports the distance of each pixel up and down apart from place text borders.Pond layer can be by big Small is 1 × 1, and the convolution kernel that step-length is 4 carries out convolutional calculation to global characteristics, and each pixel is rotated to horizontal position by output When need the angle that rotates.

Pond layer rotates angle according to pixel as a result, and Pixel-level frame distance can be by inclined text filed rotation Onto horizontal direction, text filed pond feature can be rotate to horizontal direction after text filed image data.

It, will band angle originally specifically, the text filed horizontal position that rotates to that will test needs interpolation by pond layer The text filed of degree transforms to horizontal position, so as to the identification of identification model.Interpolation needs to determine original point by transformation matrix T The calculation formula of corresponding relationship between target point, transformation matrix T is as follows:

V_ratio indicates that the height roi_h of transformed text filed mapping is arrived with current point The ratio of the sum of the distance of the text filed coboundary and lower boundary of prediction；Roi_h is default known quantity.

Wherein, roi_w=v_ratio × (l+r), roi_w indicate the width of transformed text filed mapping.

d_x=l × cos π_i-t×sinπ_i- x,

d_y=l × cos π_i+t×sinπ_i- y,

Wherein, r, l, t, b are right margin of the current pixel point to text borders of detection branches prediction respectively, left margin, Coboundary, the distance (i.e. Pixel-level frame distance) of lower boundary, π_iIndicate the tilt angle of the current pixel of detection branches prediction (i.e. Pixel-level rotation angle).(x, y) is coordinate position of the current pixel point in original image.It is assumed that point is Psrc before transformation (x_s, y_s), Pdst (x after transformation_d, y_d), then It can be by the Feature Mapping before transformation by left side equation Position obtains the position of transformed Feature Mapping multiplied by transformation matrix T, to complete interpolation of coordinate, realizes text filed water Graduation rotation.

It is transmitted to it is emphasized that being different from existing character recognition method by will test the testing result of model output Identification model completes the identification of text, and the present invention, which will test, is responsible for optimization eventually as identification as a study branch of model The characteristic pattern (i.e. pond feature) of branch's input, realizing in the same model will test result (what is detected is text filed) By the mode conversion of numerical sample at the characteristic pattern directly used for identification branch, while realizing detection and identification mission Learning training.

In a kind of exemplary embodiment, the identification branched network network layers in above-mentioned steps 370 include time convolutional network layer With character classification layer, as shown in figure 9, above-mentioned steps 370 specifically include:

In step 371, the pond feature back-propagating to the time convolutional network layer is subjected to character feature It extracts；

Wherein, back-propagating refers to that the pond feature by the output of pond layer is transmitted to time convolutional network layer, passes through the time Convolutional network layer carries out convolution transform to pond feature, extracts character string feature.CTC is used different from existing (Connectionist temporal classification, timing class classification neural network based) or Attention (attention) network structure, the present invention use TCN (time convolutional network) as a part of identification branched network network layers, which has Following advantage: since large-scale parallel can be carried out in TCN so the training of network and testing time all greatly reduce；Due to TCN can neatly adjust receptive field size by determining to stack how many convolutional layer, thus the length of more preferable explicit Controlling model Short-term memory length, and CTC or Attention identification model due to can not cycle-index inside prediction model, and then without Faxian Control the length of shot and long term memory likes；The direction of propagation of TCN and the time orientation of list entries are different, so as to avoid RNN The gradient explosion or disappearance problem that model training often occurs；The memory of TCN consumption is lower, shows more on long list entries Add obviously, reduces the application deployment expense of model.

In step 372, extracted character feature is inputted into the character classification layer, it is defeated by the character classification layer The character string of the text filed label out.

Wherein, character feature is exactly character string feature, and the character string feature of extraction is inputted character classification layer, can be with The probability that each character in text filed belongs to each character in dictionary is exported, finds out the character of maximum probability in dictionary, as The recognition result of the character in text filed, the thus character string of text filed middle label.

Figure 10 is that identification branched network network layers are configuration diagrams, and as shown in Figure 10, the pond feature of pond layer output passes through 4 Secondary time convolution operation, the input of each convolutional layer are converted by empty cause and effect convolution, weight normalization, activation primitive respectively, The output of current convolutional layer is obtained after random drop.Wherein, the filter size k of first time convolution operation be 3, convolution kernel it is swollen Swollen factor d is 1, and the filter size k of second of convolution operation is 3, and the expansion factor d of convolution kernel is 1, third time convolution operation Filter size k be 3, the expansion factor d of convolution kernel is 2, and the filter size k of the 4th convolution operation is 1, convolution kernel Expansion factor d is 4.Later, character string feature, i.e., the spy of each character are extracted by two-way LSTM (shot and long term memory network) Sign.Two-way LSTM is that it can utilize the letter in last time and future time instance both direction simultaneously better than unidirectional LSTM's Breath, so that final prediction is more accurate.The output result of two-way LSTM can be 512 feature vector, Zhi Houtong The CTC decoder for crossing character classification layer carries out the classification of 7439 classifications to output feature.Wherein 7439 classifications indicate to deposit in dictionary In 7439 characters, so as to which tagsort will be exported to one of them in 7439 characters.

Figure 11 is the network model configuration diagram of text identification in image provided by the invention, as shown in figure 11, original The image input space first separates convolutional network layer, separates convolutional network layer by space and extracts the overall situation from original image Global characteristics are distinguished input area Recurrent networks layer and pond layer later by feature, and region Recurrent networks layer is according to global characteristics Detect frame candidate region.Wherein, in the training stage, it can be returned by secondary frame and frame is classified, obtain detection Candidate frame and candidate frame confidence level calculate multitask loss, the ginseng of adjustment region Recurrent networks layer according to the position of text borders Number is preferably minimized multitask loss.The frame candidate region of region Recurrent networks layer output inputs pond layer, and pond layer can be with The global characteristics of convolutional network layer input and the frame candidate region of region Recurrent networks layer input are separated according to space, are carried out The screening and complanation of frame candidate region obtain the text filed feature of complanation, i.e. pond feature.In turn by complanation Text filed feature input time convolutional network layer extracts character string feature, and character string feature inputs character classifier, Export the character identification result of text in image.

In a kind of exemplary embodiment, as shown in figure 12, method provided by the invention further include:

In step 1210, the sample graph image set of having text information recorded thereon on image, the content of the text information are obtained It is known；

Wherein, sample graph image set includes great amount of images sample, and text information, and these texts are labeled on these image patterns Known to the particular content of this information.Sample graph image set can store in the local storage medium of user equipment 110, can also deposit Storage is in server 130.

In step 1230, the training of the network model is carried out, using the sample graph image set by adjusting the net The parameter of network model makes the difference between the character string and corresponding text information of each sample image of the network model output Different minimum.

Specifically, sample graph image set can be used as training set, the training present invention carries out net needed for text identification in image Network model.Specifically, can be using sample graph image set as the input of the network model, according to the output of the network model, adjustment The parameter of network model makes the character string recognition result and known text information of the sample graph image set of the output of network model Between difference it is minimum.For example, the phase between calculating character recognition sequence result and known text information can be passed through Like degree, keep similarity maximum.

In a kind of exemplary embodiment, as shown in figure 13, above-mentioned steps 1230 are specifically included:

In step 1231, the error and execution character knowledge that text filed detection generates are carried out according to the network model The error for not operating generation obtains the text identification error of the network model；

Wherein, network model is divided into text filed detection and character recognition operates two tasks.The text of network model is known Other error refers to the identification error of the network model general frame.The error can be the error and word that text filed detection generates The sum of the error that symbol identification generates.Wherein, before the error that text filed detection generates can be output pool feature, detection text Error existing for one's respective area, and after the error that character recognition operation generates can be output pool feature, in text filed Character carry out Classification and Identification generation error.

In step 1232, according to the text identification error, the network model is adjusted by back-propagating and carries out institute The network layer parameter of text filed detection and the network layer parameter of execution character identification operation are stated, makes the text identification error most It is small.

Back-propagating refers to according to recognition result below the parameter for adjusting preceding networks model.Specifically, according to network The identification error of model general frame, i.e., the error of last output character sequence, to adjust the text filed Detection task in front Network layer parameter and the network layer parameter of execution character identification operation, make last output character sequence and true character string it Between error it is minimum.Cognitive phase, which generates error, as a result, can be communicated to parameter of the detection part to correct detection-phase.

In a kind of exemplary embodiment, as shown in figure 14, above-mentioned steps 1231 are specifically included:

In step 1401, error, the Pixel-level frame that Pixel-level classification prediction generates are carried out according to the network model The error that error and Pixel-level the rotation angle prediction that range prediction generates generate, determines that the network model carries out text area The error that domain detection generates；

Wherein, the error that Pixel-level classification prediction generates refers to that Pixel-level classification confidence and actual pixels point belong to text Error between the classification results in region.The error that Pixel-level frame early warning and alert generates refers to each pixel apart from place text This frame up and down between Prediction distance and actual range between error, Pixel-level rotation angle prediction be pixel rotation Go to the error between the prediction rotation angle of horizontal position and practical rotation angle.

Specifically, network model, which carries out the error that text filed detection generates, is expressed as L_Detection,

L_Detection=L_cls+αL_{geo_reg}

L_DetectionIt is the total loss function of detection branches (text filed detection), L_clsIt is that Pixel-level is classified in detection branches The loss function of confidence level, that is, the error that Pixel-level classification prediction generates, L_{geo_reg}It is the loss of Pixel-level frame distance Function (apart from the distance of place frame up and down), that is, each pixel apart from place text borders up and down between Prediction distance and actual range between error alpha be L_georegRatio in total detection branches loss.

Wherein,N is a confidence level Positive value element number in prediction matrix is mapped,Current pixel whether the mark (value be 0 or 1) of text, u_iCurrent pixel Whether the predicted value (value be 0 or 1) of text.

Wherein,N is one and sets Reliability maps positive value element number, π in prediction matrix_iIndicate that the Pixel-level of prediction rotates angle,Indicate the Pixel-level of mark Angle is rotated, β indicates that angle loss accounts for L_{geo_reg}In ratio.Indicate four geometric senses of the frame of prediction (frontier distance up and down away from place text box) B_iWith four geometric senses (side up and down away from place text box of mark Boundary's distance)Between IOU loss, IOU loss function is defined as follows:

Indicate the intersection of two textboxs,Table Show the union of two textboxs.

In step 1402, the network model is subjected to the error and execution character identification behaviour that text filed detection generates The error for making to generate is weighted addition, obtains the text identification error of the network model.

Specifically, the loss function of whole network model, i.e. the text identification error of network model is expressed as follows:

L_total=L_Detection+ε_recognitionL_recogtion

L_DetectionFor the loss that detection branches generate, L_recogtionFor the loss that identification branch generates, that is, execution word The error that symbol identification operation generates, ε_recognitionTo identify that the loss of branch accounts for obtain ratio in the total loss of model, with this To control identification branch to the percentage contribution of entire model optimization.The loss that detection branches generate has calculated in step 1401 It arrives, the loss that identification branch generates is expressed as follows:

R is to want identification region number,It is the identification in the region Mark, the input that ρ is currently identified,Calculation formula it is as follows:

c^*It is character level annotated sequence, c^*={ c₀..., c_L-1, L is The length of annotated sequence, L≤7439,7439 be character number in dictionary, and the only character present in dictionary could be known Not.

It should be noted that the loss function that frame returns uses IOU (Intersection in Detection task Over Union is handed over and is compared) loss function, which loses compared with L2 following advantage: by four coordinates of frame as one Entirety, which carries out study optimization, reduces the training difficulty of model, the pace of learning of Detection accuracy and model can be improved, simultaneously It is also strengthened to the diversity adaptability of sample.

Scheme provided by the invention can support web api (web application interface) service call and mobile terminal Deployment by using technical solution provided by the invention, Direct Recognition can go out specific text from original image as shown in figure 15 Word content is exported.

Following is apparatus of the present invention embodiment, can be used for executing in the image that the above-mentioned user equipment 110 of the present invention executes The recognition methods embodiment of text.For undisclosed details in apparatus of the present invention embodiment, image Chinese of the present invention is please referred to This recognition methods embodiment.

Figure 16 is the block diagram of the identification device of text in a kind of image shown according to an exemplary embodiment, in the image The identification device of text can be used in the user equipment 110 of implementation environment shown in Fig. 1, execute Fig. 3, Fig. 6, Fig. 8, Fig. 9, figure The all or part of step of the recognition methods of text in image shown in 12- Figure 14 is any.The device passes through multiple-layer stacked Network model executes the end-to-end identification of text in image, and as shown in figure 16, which includes but is not limited to: spatial convoluted operation Module 1610, Global characteristics extraction module 1630, pond feature obtain module 1650 and character string output module 1670.

Spatial convoluted operation module 1610, the space for successively carrying out image by multilayer mode separate convolution behaviour Make, the convolution Fusion Features that the separable convolution operation in the space is extracted to the mapped low layer that is layering, the low layer With the high-rise phase mapping for exporting the convolution feature；

Global characteristics extraction module 1630, it is global special for being obtained from the bottom for executing the separable convolution operation in space Sign；

Pond feature obtains module 1650, for carrying out the candidate region detection of text in image by the global characteristics It is predicted with region screening parameter, obtains to correspond to and detect to obtain text filed pond feature；

Character string output module 1670, for the knowledge by the pond feature back-propagating to execution character identification operation Other branched network network layers export the character string of the text filed label by the identification branched network network layers.

The function of modules and the realization process of effect are specifically detailed in the identification of text in above-mentioned image in above-mentioned apparatus The realization process of step is corresponded in method, details are not described herein.

Spatial convoluted operation module 1610 such as can be some physical structure processor 218 in Fig. 2.

Global characteristics extraction module 1630, pond feature obtain module 1650 and character string output module 1670 can also To be functional module, for executing the correspondence step in above-mentioned image in the recognition methods of text.It is appreciated that these modules can With by hardware, software, or a combination of both realize.When realizing in hardware, these modules may be embodied as one or Multiple hardware modules, such as one or more specific integrated circuits.When being realized with software mode, these modules be may be embodied as The one or more computer programs executed on the one or more processors, such as storage performed by the processor 218 of Fig. 2 Program in memory 204.

Optionally, as shown in figure 17, the pond feature acquisition module 1650 includes but is not limited to:

Candidate region output unit 1651, for the global characteristics to be inputted to the region recurrence for executing candidate region detection Network layer exports the frame candidate region of text in described image by the region Recurrent networks layer；

Pond input unit 1652, for the frame candidate region to be inputted to the pond for executing region screening and region rotation Change layer；

Rotary unit 1653 is screened, for carrying out the prediction of region screening parameter to the global characteristics according to the pond layer The Pixel-level region screening parameter of acquisition filters out described text filed from the frame candidate region and rotates the text Region to horizontal position obtains the text filed pond feature.

Optionally, as shown in figure 18, the screening rotary unit 1653 includes but is not limited to:

Confidence level obtains subelement 1801, carries out convolutional calculation generation to the global characteristics for obtaining the pond layer Pixel-level classification confidence, it is text filed general that the Pixel-level classification confidence refers to that each pixel in described image belongs to Rate；

Subelement 1802 is screened in candidate region, for according to the Pixel-level classification confidence and the frame candidate regions The friendship in domain and ratio, filter out described text filed from the frame candidate region；

Text filed rotation subelement 1803, it is raw for carrying out convolutional calculation to the global characteristics according to the pond layer At Pixel-level rotation angle and Pixel-level frame distance, by interpolation algorithm will it is described it is text filed rotate to horizontal position, Obtain the text filed pond feature.

Optionally, the identification branched network network layers include time convolutional network layer and character classification layer, the character string Output module 1670 includes but is not limited to:

Character feature extraction unit, for the pond feature back-propagating to the time convolutional network layer to be carried out word Accord with the extraction of feature；

Character classification unit passes through the character point for extracted character feature to be inputted the character classification layer Class layer exports the character string of the text filed label.

Optionally, described device further includes but is not limited to:

Sample set obtains module, for obtaining the sample graph image set of having text information recorded thereon on image, the text information Content known to；

Model training module, for carrying out the training of the network model using the sample graph image set, by adjusting institute The parameter for stating network model, between the character string and corresponding text information of each sample image for exporting the network model Difference it is minimum.

Optionally, the model training module includes but is not limited to:

Model error obtaining unit, for carrying out the error and hold that text filed detection generates according to the network model The error that line character identification operation generates, obtains the text identification error of the network model；

Model parameter adjustment unit, for adjusting the network mould by back-propagating according to the text identification error Type carries out the network layer parameter of the text filed detection and the network layer parameter of execution character identification operation, knows the text Other error is minimum.

Optionally, the model error obtaining unit includes but is not limited to:

Detection error determines subelement, for according to the network model carry out Pixel-level classification prediction generate error, The error that error and Pixel-level the rotation angle prediction that Pixel-level frame range prediction generates generate, determines the network model Carry out the error that text filed detection generates；

Error merges subelement, for the network model to be carried out the error and execution character that text filed detection generates The error that identification operation generates is weighted addition, obtains the text identification error of the network model.

Optionally, the present invention also provides a kind of electronic equipment, which can be used for the use of implementation environment shown in Fig. 1 In family equipment 110, execute Fig. 3, Fig. 6, Fig. 8, Fig. 9, Figure 12-Figure 14 it is any shown in image the recognition methods of text whole Or part steps.The electronic equipment includes:

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to executing the identification side of text in image described in the above exemplary embodiments Method.

The processor of electronic equipment executes the concrete mode operated the text in the related image in the embodiment Detailed description is performed in the embodiment of recognition methods, no detailed explanation will be given here.

In the exemplary embodiment, a kind of storage medium is additionally provided, which is computer readable storage medium, It such as can be the provisional and non-transitorycomputer readable storage medium for including instruction.The storage medium is stored with computer Program, the computer program can be executed by the processor 218 of device 200 to complete the recognition methods of text in above-mentioned image.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and change can executed without departing from the scope.The scope of the present invention is limited only by the attached claims.

Claims

1. the recognition methods of text in a kind of image, which is characterized in that the method is executed by the network model of multiple-layer stacked The end-to-end identification of text in image, which comprises

The space that image is successively carried out by multilayer mode separates convolution operation, and the space is separated convolution operation and is extracted Convolution Fusion Features to the mapped low layer that is layering, the low layer and the high-rise phase mapping for exporting the convolution feature；

The candidate region detection and the prediction of region screening parameter that text in image is carried out by the global characteristics, are corresponded to It detects to obtain text filed pond feature；

By the pond feature back-propagating to the identification branched network network layers of execution character identification operation, pass through the identification branch Network layer exports the character string of the text filed label.

2. the method according to claim 1, wherein described carry out text in image by the global characteristics Candidate region detection and the prediction of region screening parameter obtain to correspond to and detect to obtain text filed pond feature, comprising:

The global characteristics are inputted to the region Recurrent networks layer for executing candidate region detection, pass through the region Recurrent networks layer Export the frame candidate region of text in described image；

The frame candidate region is inputted to the pond layer for executing region screening and region rotation；

The Pixel-level region screening parameter that screening parameter prediction in region obtains is carried out to the global characteristics according to the pond layer, It is filtered out from the frame candidate region described text filed and rotates described text filed to horizontal position, obtain the text The pond feature of one's respective area.

3. according to the method described in claim 2, it is characterized in that, described carry out the global characteristics according to the pond layer The Pixel-level region screening parameter that screening parameter prediction in region obtains, filters out the text area from the frame candidate region Domain simultaneously rotates described text filed to horizontal position, the acquisition text filed pond feature, comprising:

Obtain the Pixel-level classification confidence that the pond layer carries out convolutional calculation generation to the global characteristics, the Pixel-level Classification confidence refers to that each pixel belongs to text filed probability in described image；

According to the friendship of the Pixel-level classification confidence and the frame candidate region and ratio, from the frame candidate region In filter out it is described text filed；

According to the pond layer global characteristics are carried out with the Pixel-level rotation angle and Pixel-level frame of convolutional calculation generation The text filed rotation is obtained the text filed pond feature to horizontal position by interpolation algorithm by distance.

4. the method according to claim 1, wherein the identification branched network network layers include time convolutional network layer With character classification layer, the identification branched network network layers that the pond feature back-propagating is operated to execution character identification are led to Cross the character string that the identification branched network network layers export the text filed label, comprising:

The pond feature back-propagating to the time convolutional network layer is carried out to the extraction of character feature；

Extracted character feature is inputted into the character classification layer, the text filed mark is exported by the character classification layer The character string of note.

5. the method according to claim 1, wherein further include:

The sample graph image set for obtaining having text information recorded thereon on image, known to the content of the text information；

The training of the network model is carried out using the sample graph image set, by adjusting the parameter of the network model, makes institute The difference stated between the character string and corresponding text information of each sample image of network model output is minimum.

6. according to the method described in claim 5, it is characterized in that, described carry out the network mould using the sample graph image set The training of type makes the character sequence of each sample image of the network model output by adjusting the parameter of the network model Column are minimum with corresponding text information difference, comprising:

The error that text filed detection generates and the error that execution character identification operation generates are carried out according to the network model, Obtain the text identification error of the network model；

According to the text identification error, the net that the network model carries out the text filed detection is adjusted by back-propagating The network layer parameter of network layers parameter and execution character identification operation, keeps the text identification error minimum.

7. according to the method described in claim 6, it is characterized in that, described carry out text filed detection according to the network model The error that error and execution character the identification operation of generation generate, obtains the text identification error of the network model, comprising:

Error, the error of Pixel-level frame range prediction generation that Pixel-level classification prediction generates are carried out according to the network model And the error that Pixel-level rotation angle prediction generates, determine that the network model carries out the error that text filed detection generates；

The network model is subjected to the error that text filed detection generates and the error that execution character identification operation generates carries out Weighting summation obtains the text identification error of the network model.

8. the identification device of text in a kind of image, which is characterized in that described device is executed by the network model of multiple-layer stacked The end-to-end identification of text, described device include: in image

Spatial convoluted operation module, the space for successively carrying out image by multilayer mode separate convolution operation, will be described Space separates convolution Fusion Features that convolution operation is extracted to the mapped low layer that is layering, described in the low layer and output The high-rise phase mapping of convolution feature；

Pond feature obtains module, for carrying out candidate region detection and the region sieve of text in image by the global characteristics Parameter prediction is selected, obtains to correspond to and detects to obtain text filed pond feature；

Character string output module, for the identification branched network by the pond feature back-propagating to execution character identification operation Network layers export the character string of the text filed label by the identification branched network network layers.

9. device according to claim 8, which is characterized in that the pond feature obtains module and includes:

Candidate region output unit, for the global characteristics to be inputted to the region Recurrent networks layer for executing candidate region detection, The frame candidate region of text in described image is exported by the region Recurrent networks layer；

Pond input unit, for the frame candidate region to be inputted to the pond layer for executing region screening and region rotation；

Rotary unit is screened, for carrying out the picture that screening parameter prediction in region obtains to the global characteristics according to the pond layer Plain grade region screening parameter is filtered out described text filed from the frame candidate region and is rotated described text filed to water Prosposition is set, and the text filed pond feature is obtained.

10. device according to claim 9, which is characterized in that the screening rotary unit includes:

Confidence level obtains subelement, and for obtaining the pond layer global characteristics are carried out with the Pixel-level of convolutional calculation generation Classification confidence, the Pixel-level classification confidence refer to that each pixel belongs to text filed probability in described image；

Candidate region screen subelement, for according to the friendship of the Pixel-level classification confidence and the frame candidate region simultaneously Ratio filters out described text filed from the frame candidate region；

Text filed rotation subelement, for according to the pond layer global characteristics to be carried out with the pixel of convolutional calculation generation Grade rotation angle and Pixel-level frame distance obtain described by interpolation algorithm by the text filed rotation to horizontal position Text filed pond feature.

11. device according to claim 8, which is characterized in that the identification branched network network layers include time convolutional network Layer and character classification layer, the character string output module include:

Character feature extraction unit, it is special for the pond feature back-propagating to the time convolutional network layer to be carried out character The extraction of sign；

Character classification unit passes through the character classification layer for extracted character feature to be inputted the character classification layer Export the character string of the text filed label.

12. device according to claim 8, which is characterized in that described device further include:

Sample set obtains module, for obtaining the sample graph image set of having text information recorded thereon on image, the text information it is interior Known to appearance；

Model training module, for carrying out the training of the network model using the sample graph image set, by adjusting the net The parameter of network model makes the difference between the character string and corresponding text information of each sample image of the network model output Different minimum.

13. device according to claim 12, which is characterized in that it is described benefit type training module include:

Model error obtaining unit, for carrying out the error and execute word that text filed detection generates according to the network model The error that symbol identification operation generates, obtains the text identification error of the network model；

Model parameter adjustment unit, for according to the text identification error, by back-propagating adjust the network model into The network layer parameter of the row text filed detection and the network layer parameter of execution character identification operation, miss the text identification It is poor minimum.

14. a kind of electronic equipment, which is characterized in that the electronic equipment includes:

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to executing the identification for completing text in image described in claim 1-7 any one Method.

15. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program can be executed the identification for completing text in image described in claim 1-7 any one as processor Method.