CN109271967B - Method and device for recognizing text in image, electronic equipment and storage medium - Google Patents

Method and device for recognizing text in image, electronic equipment and storage medium Download PDF

Info

Publication number
CN109271967B
CN109271967B CN201811202558.2A CN201811202558A CN109271967B CN 109271967 B CN109271967 B CN 109271967B CN 201811202558 A CN201811202558 A CN 201811202558A CN 109271967 B CN109271967 B CN 109271967B
Authority
CN
China
Prior art keywords
text
layer
region
character
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811202558.2A
Other languages
Chinese (zh)
Other versions
CN109271967A (en
Inventor
刘铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201811202558.2A priority Critical patent/CN109271967B/en
Publication of CN109271967A publication Critical patent/CN109271967A/en
Application granted granted Critical
Publication of CN109271967B publication Critical patent/CN109271967B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a method and a device for recognizing a text in an image, electronic equipment and a computer readable storage medium, wherein the scheme executes end-to-end recognition of the text in the image through a multilayer superposed network model, and comprises the following steps: performing spatial separable convolution operation of the image layer by layer in a multilayer mode, fusing convolution characteristics extracted by the spatial separable convolution operation to a low layer mapped by layer-by-layer superposition, and mapping the low layer with a high layer of output convolution characteristics; obtaining global features from a bottom layer performing a spatially separable convolution operation; candidate region detection and region screening parameter prediction of texts in the image are carried out through the global features, and pooling features corresponding to the detected text regions are obtained; and backward propagating the pooled features to a recognition branch network layer for executing character recognition operation, and outputting the character sequence of the text region mark through the recognition branch network layer. The scheme saves the model training time and improves the identification accuracy.

Description

Method and device for identifying text in image, electronic equipment and storage medium
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for recognizing text in an image, an electronic device, and a computer-readable storage medium.
Background
In the field of computer image processing, text recognition is a method for allowing a computer to automatically determine which character in an image belongs to which character in a character library, which is established by people in advance and usually contains the most commonly used characters in real life.
The identification of the text in the image is usually realized by building two models, wherein one model is used for finding out the position of the text in a natural scene image containing the text, and then cutting out the text area from the image. Another model is used to identify the specific character content of text regions. Specifically, a large number of sample images containing different characters are obtained as a training set, and the sample images are used for training a character classifier and a text locator respectively. After training is finished, a text region is firstly positioned from the image to be tested through a text positioner, then the text region is cut out, and then the character content of the text region is identified through a character classifier.
According to the scheme, the sample images are required to be used for training the character classifier and the text positioner respectively, the workload of model training is large, and the recognition accuracy of the final characters is influenced by the accuracy of the two models, so that the improvement of the text recognition accuracy in the images is limited.
Disclosure of Invention
The invention provides a method for recognizing texts in images, which aims to solve the problems that in the related art, training of a character classifier and training of a text positioner are required to be performed respectively, the workload of model training is large, and the recognition accuracy is low.
The invention provides a method for recognizing texts in images, which executes end-to-end recognition of texts in images through a multilayer superposed network model, and comprises the following steps:
performing spatial separable convolution operation of the image layer by layer in a multilayer mode, fusing convolution features extracted by the spatial separable convolution operation to a lower layer mapped by layer-by-layer superposition, and mapping the lower layer with a higher layer outputting the convolution features;
obtaining global features from a bottom layer performing a spatially separable convolution operation;
performing candidate region detection and region screening parameter prediction of a text in the image through the global features to obtain pooling features corresponding to the detected text regions;
and backward propagating the pooled features to a recognition branch network layer which executes character recognition operation, and outputting the character sequence of the text region mark through the recognition branch network layer.
In another aspect, the present invention provides an apparatus for recognizing a text in an image, the apparatus performing end-to-end recognition of the text in the image through a network model in which a plurality of layers are stacked, the apparatus comprising:
the spatial convolution operation module is used for carrying out spatial separable convolution operation on the image layer by layer in a multilayer mode, fusing convolution features extracted by the spatial separable convolution operation to a lower layer mapped by layer-by-layer superposition, and mapping the lower layer with a higher layer outputting the convolution features;
a global feature extraction module for obtaining global features from a bottom layer performing a spatially separable convolution operation;
the pooling feature obtaining module is used for detecting candidate regions of texts in the images and predicting region screening parameters through the global features to obtain pooling features corresponding to the detected text regions;
and the character sequence output module is used for backward propagating the pooled features to a recognition branch network layer for executing character recognition operation, and outputting the character sequence of the text region mark through the recognition branch network layer.
In another aspect, the present invention further provides an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform a method for recognizing text in the image.
In addition, the invention also provides a computer readable storage medium, which stores a computer program, and the computer program can be executed by a processor to complete the method for recognizing the text in the image.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
according to the technical scheme provided by the invention, the end-to-end recognition of the text in the image is executed through the network models which are stacked in a multilayer manner, so that the recognition of the text in the image can be realized only by training one network model without separately training a text positioner and a character classifier, the workload of model training is reduced, the accuracy of final recognition is only influenced by the accuracy of one network model, the improvement of the recognition accuracy can be facilitated, and the situation that the improvement of the recognition accuracy is mutually limited by the two models is avoided.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic illustration of an implementation environment according to the present invention;
FIG. 2 is a block diagram illustrating an apparatus in accordance with an exemplary embodiment;
FIG. 3 is a flow diagram illustrating a method for recognition of text in an image according to an exemplary embodiment;
FIG. 4 is a schematic diagram of a network architecture of spatially separable convolutional network layers;
FIG. 5 is a schematic diagram of a network architecture for recognizing characters in an image according to the present invention;
FIG. 6 is a detailed flowchart of step 350 in a corresponding embodiment of FIG. 3;
FIG. 7 is a schematic diagram of the principle of the pooling layer extracting pixel-level region screening parameters from global features;
FIG. 8 is a flowchart showing details of step 353 in the corresponding embodiment of FIG. 6;
FIG. 9 is a detailed flowchart of step 370 in a corresponding embodiment of FIG. 3;
FIG. 10 is a block diagram illustrating the architecture of the identified branch network layer;
FIG. 11 is a schematic diagram of a network architecture of a method for recognizing text in an image according to the present invention;
FIG. 12 is a flowchart of a method for recognizing text in an image according to another embodiment based on the corresponding embodiment in FIG. 3;
FIG. 13 is a flowchart detailing step 1230 in a corresponding embodiment of FIG. 12;
FIG. 14 is a detailed flowchart of step 1231 in a corresponding embodiment of FIG. 13;
fig. 15 is a schematic diagram of the effect of the practical application of the present invention.
FIG. 16 is a block diagram illustrating an apparatus for recognition of text in an image according to an exemplary embodiment;
FIG. 17 is a block diagram of details of a pooled feature acquisition module in a corresponding embodiment of FIG. 16;
fig. 18 is a detailed block diagram of the screening rotating unit in the corresponding embodiment of fig. 17.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
FIG. 1 is a schematic illustration of an implementation environment in accordance with the present invention. The implementation environment includes: the user equipment 110, the user equipment 110 may perform recognition of text in the image by running an application. The user equipment may be a server, a desktop computer, a mobile terminal, an intelligent appliance, etc.
The user device 110 may include an image capturing device 111 such as a camera, and further perform text recognition on an image captured by the image capturing device 111 by using the method provided by the present invention.
According to the requirement, the implementation environment can also comprise a server 130 besides the user equipment 110, the server 130 is connected with the user equipment 110 through a wired or wireless network, the server 130 sends the image to be recognized to the user equipment 110, and the user equipment 110 recognizes the text in the image.
In practical application, the text content recognized from the image can be further subjected to text translation, text content editing, storage and the like. The method for recognizing the text in the image can be applied to a text recognition task in any scene, and realizes understanding of the content of the text in the image, such as recognition of characters in natural scene character pictures, advertisement pictures, videos, identity cards, driving licenses, business cards and license plates.
Fig. 2 is a block diagram illustrating an apparatus 200 according to an example embodiment. The apparatus 200 may be, for example, the user equipment 110 in the implementation environment shown in fig. 1.
Referring to fig. 2, the apparatus 200 may include one or more of the following components: a processing component 202, a memory 204, a power component 206, a multimedia component 208, an audio component 210, a sensor component 214, and a communication component 216.
The processing component 202 generally controls overall operation of the device 200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations, among others. The processing components 202 may include one or more processors 218 to execute instructions to perform all or a portion of the steps of the methods described below. Further, the processing component 202 can include one or more modules that facilitate interaction between the processing component 202 and other components. For example, the processing component 202 can include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 202.
The memory 204 is configured to store various types of data to support operations at the apparatus 200. Examples of such data include instructions for any application or method operating on the apparatus 200. The Memory 204 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. Also stored in memory 204 are one or more modules configured to be executed by the one or more processors 218 to perform all or a portion of the steps of any of the methods of fig. 3, 6, 8, 9, 12-14, described below.
The power supply component 206 provides power to the various components of the device 200. The power components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 200.
The multimedia component 208 includes a screen that provides an output interface between the device 200 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. The screen may further include an Organic Light Emitting Display (OLED for short).
The audio component 210 is configured to output and/or input audio signals. For example, the audio component 210 may include a Microphone (MIC) configured to receive external audio signals when the device 200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 204 or transmitted via the communication component 216. In some embodiments, audio component 210 also includes a speaker for outputting audio signals.
The sensor component 214 includes one or more sensors for providing various aspects of status assessment for the device 200. For example, the sensor assembly 214 may detect an open/closed state of the device 200, the relative positioning of the components, the sensor assembly 214 may also detect a change in position of the device 200 or a component of the device 200, and a change in temperature of the device 200. In some embodiments, the sensor assembly 214 may also include a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 216 is configured to facilitate wired or wireless communication between the apparatus 200 and other devices. The device 200 may access a WIreless network based on a communication standard, such as WiFi (WIreless-Fidelity). In an exemplary embodiment, the communication component 216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the Communication component 216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, bluetooth technology, and other technologies.
In an exemplary embodiment, the apparatus 200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital signal processors, digital signal processing devices, programmable logic devices, field programmable gate arrays, controllers, microcontrollers, microprocessors or other electronic components for performing the methods described below.
FIG. 3 is a flow diagram illustrating a method for recognition of text in an image according to an exemplary embodiment. The scope of applicability and the subject of execution of the method of recognition of text in an image may be a user device, which may be the user device 110 of the implementation environment shown in fig. 1. The method performs end-to-end recognition of text in an image through a multi-layered superimposed network model. The end-to-end recognition means that the input of the network model is original image data, and the output is a final character sequence. As shown in fig. 3, the method specifically includes the following steps.
In step 310, spatially separable convolution operations of an image are performed layer by layer in a multi-layer manner, and convolution features extracted by the spatially separable convolution operations are fused to a layer-by-layer mapped lower layer, which is mapped to a higher layer outputting the convolution features.
It should be noted that the network model of the multi-layer stack may include a spatially separable convolutional network layer, a regional regression network layer, a pooling layer, a temporal convolutional network layer, and a character classification layer. The spatial separable convolution network layer, the regional regression network layer and the pooling layer are used as detection branches for extracting pooling characteristics of text regions in the images according to original image data, and the temporal convolution network layer and the character classification layer are used as identification branches for outputting character sequences of the text regions according to the pooling characteristics of the text regions.
Specifically, the spatially separable convolution operation refers to performing convolution calculation on an image to be recognized layer by layer in a multi-layer manner by using a spatially separable convolution (Effnet) layer. Wherein the spatially separable convolutional layer comprises a mapped upper layer and a lower layer, the upper layer and the lower layer being relative concepts, the first layer being computed and the second layer being computed and the lower layer. The convolution characteristics extracted by the high-level convolution calculation are fused to the low level mapped by the layer-by-layer superposition, which means that the convolution calculation result of the low level needs to be combined with the convolution calculation result of the high level. Because the more the number of the convolution layers, the more the details are lost, and by fusing the convolution characteristics extracted from the high layer to the low layer, the more the details can be retained, and the information loss can be avoided.
In step 330, global features are obtained from the lowest layer where the spatially separable convolution operation is performed.
The bottom layer is the last output layer of the spatially separable convolution layer, the spatially separable convolution layer performs spatially separable convolution operation on the original image to be recognized layer by layer in a multi-layer mode, and the finally output feature matrix is called as global features. The global features may be used to characterize feature information of the original input image.
Fig. 4 is a schematic diagram of a network architecture of a spatially separable convolutional layer, and as shown in fig. 4, an original image to be recognized is used as an input of the spatially separable convolutional layer, and then convolutional calculation is performed layer by layer, features extracted from a high layer are fused to a lower layer of a mapping, and global features are output at the lowest layer of the spatially separable convolutional layer. Wherein each parallelogram represents the convolution feature extracted for each layer.
In step 350, candidate region detection and region screening parameter prediction of the text in the image are performed through the global features, and pooling features corresponding to the detected text regions are obtained.
The candidate area detection means detecting a candidate area where a text in an image is located according to the global features, and the number of the candidate areas may be multiple. The prediction of the region screening parameters refers to obtaining predicted values of the region screening parameters according to the global features, and the candidate regions can be screened according to the predicted values, so that the detection precision of the text regions in the image is improved. The pooling characteristic of the text region refers to characteristic data of the text region output by the pooling layer, and in one embodiment, the pooling characteristic of the text region may be image data after the text region is leveled, and the leveling refers to rotating the tilted text region to a horizontal position.
Specifically, the global features output by the spatial separable convolutional network layer may be input into the regional regression network layer and the pooling layer, respectively, and candidate regions of the text in the image are detected by the regional regression network layer, so as to output candidate regions of text borders, which are simply referred to as border candidate regions. The global features are subjected to convolution transformation through the pooling layer to realize prediction of region screening parameters, then the pooling layer screens frame candidate regions according to the region screening parameters, text regions in the image can be detected, and then the inclined text regions are rotated to obtain image data of the horizontal text regions to serve as the pooling features of the text regions.
In step 370, the pooled features are propagated back to a recognition branch network layer performing character recognition operations, through which the character sequence of the text region label is output.
The recognition branch network layer is the last layers of the multi-layer superposed network model and is used for recognizing characters contained in the text area according to the pooling characteristics of the text area. Specifically, the identification branch network layer comprises a time convolution network layer and a character classification layer of the network model. Specifically, the pooling layer transmits the pooling features of the text region to the time convolution network layer, convolution calculation is performed on the pooling features through the time convolution network layer, character sequence features are extracted, the character sequence features are transmitted to the character classification layer, and the probability that each character belongs to each character in the dictionary is output through the character classification layer.
For example, assuming that the dictionary contains 7439 words, the character classification layer may output the probability that each character in the text region belongs to each word in the dictionary, and the word with the highest probability in the dictionary is the recognition result of the character in the text region, so that for a plurality of characters in the text region, the recognition result of each character may be output, and the character sequence of the text region mark may be obtained.
According to the technical scheme provided by the exemplary embodiment of the invention, the end-to-end recognition of the text in the image is executed through the multilayer superposed network models, so that the recognition of the text in the image can be realized only by training one network model without separately training a text positioner and a character classifier, the workload of model training is reduced, the accuracy of final recognition is only influenced by the accuracy of one network model, the improvement of the recognition accuracy can be facilitated, and the situation that the improvement of the recognition accuracy is mutually limited by the two models is avoided.
With respect to the technical solutions provided by the above exemplary embodiments of the present invention, fig. 5 is a flowchart of a text recognition scheme. As shown in fig. 5, the text recognition scheme divides text detection and recognition into two tasks, and the recognition task can be performed only after the detection task is completed. Specifically, during detection, the original image is firstly input into a feature extraction convolution network, then the extracted features are transmitted to a regional regression network, the regional regression network outputs detected frame candidate regions, but the regions are rough and further frame regression is needed to improve the accuracy of the frame so as to be closer to the edge of the character, and secondary frame regression and classification can give the coordinates and the corresponding confidence coefficient of the character frame in the image, namely the possibility of containing the character. The two prediction results are compared with the position of the character label in the image, then the prediction loss is calculated through a loss function, and the parameter updating of the model is adjusted according to the loss.
When the oblique characters are detected, a large blank area exists above a frame candidate area detected by the regional regression network, which reduces the precision of the detection frame, so that the frame candidate area output by the regional regression network and the global features extracted by the feature extraction convolutional network are input to the rotational interest area pooling layer together to obtain the detected oblique character area. As shown in fig. 5, the inclined text area is marked in the original image in the form of a text box, and then the corresponding area is cut out from the original image according to the coordinates of the text box, thereby completing the positioning of the area where the text is located. It should be noted that at this stage, there is already a positioning error in the region where the text is located.
And then, inputting the area image where the cut characters are located into an identification network, wherein the identification network firstly extracts the convolution characteristics of the input area image, then provides the extracted convolution characteristics to a character classification layer, the character classification layer identifies the character sequence represented by the input sequence, and when all character areas in the original image are identified, the character identification task of the original image is completed. It should be noted that, at this stage, the difference between the character sequence output by the character classification layer and the actual character sequence needs to be calculated through another loss function, and the parameter update of the recognition network and the character classification layer is adjusted according to the difference. That is, since there is an error in character recognition at this stage, the final overall recognition error includes an error in character area positioning and an error in character recognition.
It should be noted that if the character region positioning error is large, even if the accuracy of character recognition is improved, the improvement of the entire recognition accuracy is limited. The region detection and the character recognition are trained separately, which is not beneficial to the improvement of the performance, and the error generated in the recognition stage can not be transmitted to the detection part to correct the parameters of the detection model, so that the bottleneck of detection or recognition performance can be generated on some training sets. Also, training the detection model and the recognition model separately increases the workload of model training. The speed of extracting the features of the feature extraction convolutional network is low, the number of tasks processed by the whole system in unit time is influenced, and meanwhile, the deployment of a model at a mobile terminal is not facilitated.
The invention realizes the end-to-end recognition of the text in the image through the network model with multilayer superposition, namely, the input is the original image, the output is the character sequence, the accuracy of the final character recognition is only determined by the error of one model, and the two tasks are synthesized from the model, thereby avoiding the performance bottleneck caused by separate training and being beneficial to the improvement of the recognition accuracy; and only one network model is trained to realize the recognition of the text, so that the time for training the model is greatly saved, at least half of the time is saved compared with the time for separately training the two models, and in practice, the time for adjusting the parameters is saved by 4 to 5 times because the parameter settings of the two models are different. In addition, the invention uses the Effnet network architecture to carry out the spatial separable convolution operation of the image, and the convolution characteristics extracted by the spatial separable convolution operation are fused to the low layer mapped by layer-by-layer superposition, thereby not only realizing the acceleration of the global characteristic extraction stage, but also overcoming the defect that the model precision needs to be sacrificed when the existing acceleration network structure is accelerated, reducing the storage space required by the model operation and being convenient for the deployment and application at the mobile terminal.
In an exemplary embodiment, as shown in fig. 6, the step 350 specifically includes:
in step 351, inputting the global features into a regional regression network layer for performing candidate region detection, and outputting a frame candidate region of the text in the image through the regional regression network layer;
it should be explained that the present invention performs end-to-end recognition of text in an image through a network model with multiple layers superimposed, and the local regression network layers are several layers of the network model and are used for detecting the regions where the text may be located, that is, performing candidate region detection.
Specifically, global features are extracted from an original image through a spatial separable convolutional network layer of the network model, the global features are input into a regional regression network layer, and frame candidate regions of texts in the image are output through the regional regression network layer. The frame candidate area refers to an area that the text edge may enclose. In the training stage, the frame candidate region of the text can be output through the regional regression network layer, the confidence coefficient (possibility of containing characters) of the detected candidate frame and the candidate frame is obtained by performing secondary frame regression and classification on the frame candidate region, the multitask loss is calculated according to the position coordinates of the actual text frame, and the loss is minimized by adjusting the parameters of the regional regression network layer. The regional regression network layer can be fast-R-CNN (fast target detection convolutional neural network), and the main contribution of the fast-R-CNN is to design a network architecture for extracting candidate regions, so that time-consuming selective search is replaced, and the detection speed is greatly improved.
In step 352, the bounding box candidate region is input into a pooling layer that performs region screening and region rotation;
the pooling layer is connected with the space separable convolutional network layer and is used for performing region screening and region rotation on the frame candidate region according to the global features output by the space separable convolutional network layer. The region screening refers to screening out a region where an accurate text is located from a plurality of frame candidate regions, and the region rotation refers to rotating an inclined text region to a horizontal position. Therefore, the frame candidate region output by the regional regression network layer and the global feature output by the spatial separable convolution network layer are jointly input into the pooling layer.
In step 353, according to the pixel level region screening parameter obtained by predicting the region screening parameter of the global feature by the pooling layer, screening the text region from the frame candidate region and rotating the text region to a horizontal position, so as to obtain the pooling feature of the text region.
The pixel-level region screening parameters refer to parameters for screening and rotating the frame candidate region obtained according to global feature prediction. The pixel-level region screening parameters may include pixel-level classification confidence, pixel-level rotation angle, and pixel-level bounding box distance. The text region refers to a region where text is located. The pooling layer can perform convolution transformation on the global characteristics through multiple convolution cores to obtain pixel-level region screening parameters, and then screen out text regions from multiple frame candidate regions according to the pixel-level region screening parameters, and further rotate the inclined text regions to the horizontal position to obtain the pooling characteristics of the text regions.
As shown in fig. 7, the global feature is transformed by a first convolution kernel, and a pixel-level classification confidence, i.e., a probability that each pixel in the original image belongs to text, is output. And the global features are transformed by a second convolution kernel, and the pixel-level frame distance, namely the predicted distance from each pixel point to the upper, lower, left and right sides of the text frame where the pixel point is located, is output. And the global features are transformed by a third convolution kernel, and a pixel-level rotation angle, namely the angle required to rotate when each pixel point rotates to the horizontal position, is output.
In an exemplary embodiment, as shown in fig. 8, the step 353 specifically includes:
in step 3531, a pixel-level classification confidence degree generated by the pooling layer performing convolution calculation on the global features is obtained, where the pixel-level classification confidence degree refers to a probability that each pixel in the image belongs to a text region;
specifically, the pooling layer may perform convolution calculation on the global feature (feature image) by a convolution kernel with a size of 1 × 1 and a step size of 1, and output a confidence prediction result that each pixel belongs to the text, thereby obtaining a pixel-level classification confidence. The high confidence pixel point indicates that the pixel point has a high probability of belonging to the text region, and similarly, the low confidence pixel point indicates that the pixel point has a low probability of belonging to the text region.
In step 3532, the text region is filtered out of the frame candidate region according to the pixel-level classification confidence and the intersection proportion of the frame candidate region;
the intersection proportion of the frame candidate regions refers to the overlapping proportion of different frame candidate regions. Because the noise frame exists in the frame candidate region, the method performs non-maximum suppression on the detection result of the frame candidate region according to the pixel level classification confidence coefficient and the intersection proportion of the frame candidate region, thereby screening the text region from the frame candidate region and improving the accuracy of text region detection.
Specifically, a non-maximum suppression algorithm is used, the confidence degrees are classified according to the pixel levels, the frame candidate regions with high confidence degrees are reserved, the frame candidate regions without overlapping are reserved, the frame candidate regions with low intersection ratios are reserved, and therefore the text regions are obtained by screening from all the frame candidate regions.
In step 3533, the text region is rotated to a horizontal position by an interpolation algorithm according to a pixel-level rotation angle and a pixel-level border distance generated by convolution calculation of the global features by the pooling layer, so as to obtain the pooled features of the text region.
It should be noted that, when the pooling layer obtains the pixel-level classification confidence, the pooling layer may perform convolution calculation on the global features at the same time to obtain a pixel-level rotation angle and a pixel-level frame distance. Referring to the above explanation, the pixel-level rotation angle refers to an angle that needs to be rotated when each pixel point rotates to a horizontal position, and the pixel-level border distance refers to a predicted distance from each pixel point to the top, bottom, left and right sides of a text border where the pixel point is located. Specifically, the pooling layer may perform convolution calculation on the global feature through a convolution check with a size of 1 × 1 and a step length of 4, and output a distance from each pixel point to the upper, lower, left, and right sides of a text frame where the pixel point is located. The pooling layer can perform convolution calculation on the global characteristics through a convolution check with the size of 1 × 1 and the step length of 4, and outputs an angle required to rotate each pixel point to a horizontal position.
Therefore, the pooling layer can rotate the inclined text region to the horizontal direction according to the pixel point rotation angle and the pixel level frame distance, and the pooling feature of the text region can be image data of the text region after the text region is rotated to the horizontal direction.
Specifically, rotating the detected text region to the horizontal position requires interpolation through the pooling layer, and the text region with the original angle is converted to the horizontal position, so as to identify the model. The interpolation needs to determine the corresponding relationship between the original point and the target point through a transformation matrix T, and the calculation formula of the transformation matrix T is as follows:
Figure BDA0001830362250000121
Figure BDA0001830362250000122
Figure BDA0001830362250000123
v _ ratio represents the ratio of the height roi _ h of the transformed text region map to the sum of the distances from the current point to the upper and lower boundaries of the predicted text region; roi _ h is a preset known quantity.
Where, roi _ w is v _ ratio x (l + r), and roi _ w represents the width of the transformed text region map.
d x =l×cosπ i -t×sinπ i -x,
d y =l×cosπ i +t×sinπ i -y,
Wherein, r, l, t, b are the distances from the current pixel point of the detection branch prediction to the right boundary, the left boundary, the upper boundary, the lower boundary (i.e. the pixel-level border distance), pi of the text border respectively i Representing the tilt angle (i.e., pixel level rotation angle) of the current pixel of the detected branch prediction. (x, y) is the coordinate position of the current pixel point in the original image. Assume that the pre-transform point is Psrc (x) s ,y s ) Post-transform Pdst (x) d ,y d ) Then, then
Figure BDA0001830362250000124
Figure BDA0001830362250000125
The feature mapping position before transformation can be multiplied by the transformation matrix T through a left equation to obtain the transformed feature mapping position, so that coordinate interpolation is completed, and horizontal rotation of the text region is realized.
It is emphasized that, unlike the existing character recognition method, which completes the character recognition by transmitting the detection result output by the detection model to the recognition model, the present invention takes charge of optimizing the feature map (i.e. pooling feature) finally input as the recognition branch by detecting as a learning branch of the model, and realizes that the detection result (i.e. the detected text region) is converted into the feature map which can be directly used by the recognition branch in the same model by means of numerical value sampling, thereby realizing the simultaneous learning training of the detection and recognition tasks.
In an exemplary embodiment, the identification branch network layer in the step 370 includes a time convolution network layer and a character classification layer, as shown in fig. 9, the step 370 specifically includes:
in step 371, backward propagating the pooled features to the temporal convolution network layer for extracting character features;
the backward propagation refers to transmitting the pooled features output by the pooled layer to a time convolution network layer, and performing convolution transformation on the pooled features through the time convolution network layer to extract character sequence features. Unlike the existing CTC (connection temporal classification based on neural networks) or Attention network structure, the present invention uses TCN (time convolutional network) as a part of identifying branch network layer, which has the following advantages: the training and testing time of the network is greatly shortened because large-scale parallel can be carried out in the TCN; the TCN can flexibly adjust the magnitude of the receptive field by determining how many convolution layers are stacked, so that the long-term and short-term memory length of the model can be better controlled in an explicit mode, and the CTC or Attention recognition model cannot control the long-term and short-term memory length due to the fact that the internal cycle times of the model cannot be estimated in the CTC or Attention recognition model; the propagation direction of the TCN is different from the time direction of the input sequence, so that the problem of gradient explosion or disappearance frequently caused by RNN model training is solved; the TCN consumes lower memory, is more obviously represented on a long input sequence, and reduces the deployment and application expenses of the model.
In step 372, the extracted character features are input into the character classification layer, and the character sequence of the text region mark is output through the character classification layer.
The character features are character sequence features, the extracted character sequence features are input into the character classification layer, the probability that each character in the text region belongs to each character in the dictionary can be output, the character with the maximum probability in the dictionary is found out, the character is the recognition result of the character in the text region, and therefore the character sequence marked in the text region is obtained.
Fig. 10 is a schematic diagram of a structure of the identified branch network layer, and as shown in fig. 10, the pooled features output by the pooled layer are subjected to 4 time convolution operations, and the input of each convolutional layer is respectively subjected to hole causal convolution, weight normalization, activation function transformation, and random discarding to obtain the output of the current convolutional layer. The filter size k of the first convolution operation is 3, the expansion factor d of the convolution kernel is 1, the filter size k of the second convolution operation is 3, the expansion factor d of the convolution kernel is 1, the filter size k of the third convolution operation is 3, the expansion factor d of the convolution kernel is 2, the filter size k of the fourth convolution operation is 1, and the expansion factor d of the convolution kernel is 4. Thereafter, character sequence features, i.e., features of each character, are extracted through a bidirectional LSTM (long short term memory network). Bidirectional LSTM is preferred over unidirectional LSTM in that it can use information in both past and future times to make the final prediction more accurate. The output result of the bi-directional LSTM may be a feature vector of 512, followed by classification of the output features by a CTC decoder of the character classification layer into 7439 classes. Where the 7439 class indicates that there are 7439 characters in the dictionary so that the output features can be classified into one of the 7439 characters.
Fig. 11 is a schematic diagram of a network model architecture for text recognition in an image according to the present invention, and as shown in fig. 11, an original image is first input into a spatial separable convolutional network layer, global features are extracted from the original image through the spatial separable convolutional network layer, and then the global features are respectively input into a regional regression network layer and a pooling layer, and the regional regression network layer detects a frame candidate region according to the global features. In the training stage, the detected candidate frame and the confidence coefficient of the candidate frame can be obtained through secondary frame regression and frame classification, the multitask loss is calculated according to the position of the text frame, and the parameter of the regional regression network layer is adjusted to minimize the multitask loss. The frame candidate region output by the regional regression network layer is input into the pooling layer, and the pooling layer can perform screening and leveling on the frame candidate region according to the global feature input by the spatial separable convolutional network layer and the frame candidate region input by the regional regression network layer to obtain a leveled text region feature, namely a pooling feature. And inputting the horizontal text region characteristics into the time convolution network layer, extracting character sequence characteristics, inputting the character sequence characteristics into a character classifier, and outputting a character recognition result of the text in the image.
In an exemplary embodiment, as shown in fig. 12, the method provided by the present invention further includes:
in step 1210, a sample image set in which text information is recorded on an image is obtained, wherein the content of the text information is known;
the sample image set comprises a large number of image samples, the image samples are marked with text information, and the specific content of the text information is known. The sample image set may be stored in a local storage medium of the user device 110 or may be stored in the server 130.
In step 1230, the set of sample images is used to train the network model, and the parameters of the network model are adjusted to minimize the difference between the character sequence of each sample image output by the network model and the corresponding text information.
Specifically, the sample image set can be used as a training set to train a network model required for text recognition in the image. Specifically, the sample image set may be used as an input of the network model, and parameters of the network model may be adjusted according to an output of the network model, so as to minimize a difference between a character sequence recognition result of the sample image set output by the network model and known text information. For example, the similarity may be maximized by calculating the similarity between the character sequence recognition result and the known text information.
In an exemplary embodiment, as shown in fig. 13, the step 1230 specifically includes:
in step 1231, obtaining a text recognition error of the network model according to an error generated by text region detection performed by the network model and an error generated by executing a character recognition operation;
the network model is divided into two tasks of text region detection and character recognition operation. The text recognition error of the network model refers to the recognition error of the whole framework of the network model. The error may be the sum of an error generated by text region detection and an error generated by character recognition. The error generated by the text region detection may be an error existing in the detected text region before the output of the pooled feature, and the error generated by the character recognition operation may be an error generated by performing classification recognition on the characters in the text region after the output of the pooled feature.
In step 1232, according to the text recognition error, the network layer parameters for the network model to perform the text region detection and the network layer parameters for performing the character recognition operation are adjusted by back propagation, so that the text recognition error is minimized.
Back propagation refers to adjusting the parameters of the previous network model based on the results of the later recognition. Specifically, according to the recognition error of the whole framework of the network model, i.e. the error of the final output character sequence, the network layer parameters of the previous text region detection task and the network layer parameters for executing the character recognition operation are adjusted, so that the error between the final output character sequence and the real character sequence is minimized. Thus, the error generated in the recognition stage can be transmitted to the detection part to correct the parameters in the detection stage.
In an exemplary embodiment, as shown in fig. 14, the step 1231 specifically includes:
in step 1401, determining an error generated by the network model for text region detection according to an error generated by the network model for pixel-level classification prediction, an error generated by pixel-level border distance prediction, and an error generated by pixel-level rotation angle prediction;
the error generated by the pixel-level classification prediction refers to an error between the pixel-level classification confidence and a classification result of an actual pixel point belonging to a text region. The error generated by the pixel-level frame early warning prediction refers to the error between the actual distance and the prediction distance between the upper part, the lower part, the left part and the right part of the text frame where each pixel point is located, and the pixel-level rotation angle prediction refers to the error between the actual rotation angle and the prediction rotation angle when the pixel points rotate to the horizontal position.
Specifically, the error generated by the text region detection performed by the network model is represented as L Detection
L Detection =L cls +αL geo_reg
L Detection Is the total loss function of the detected branches (text region detection), L cls Is a loss function of confidence of pixel-level classification in the detection branch, i.e. the error generated by prediction of pixel-level classification, L geo_reg Is a loss function of the pixel-level border distance (distance from the upper part to the lower part and the left part to the right part of the border), namely the error alpha between the predicted distance between the upper part and the lower part and the left part and the right part of the text border of the distance of each pixel point and the actual distance is L georeg The proportion in the total detected branch loss.
Wherein the content of the first and second substances,
Figure BDA0001830362250000151
n is the number of positive valued elements in a confidence map prediction matrix,
Figure BDA0001830362250000152
marking whether the current pixel is a character (taking a value of 0 or 1), u i Whether the current pixel is the predicted value of the character (the value is 0 or 1).
Wherein the content of the first and second substances,
Figure BDA0001830362250000153
n is the number of positive value elements in a confidence mapping prediction matrix, pi i Representing the predicted pixel-level rotation angle,
Figure BDA0001830362250000161
indicating the pixel-level rotation angle of the label, beta indicates the angular loss accounting for L geo_reg The ratio of (1).
Figure BDA0001830362250000162
Four geometric quantities (distance from upper, lower, left and right boundaries of the text box) B representing the predicted bounding box i Four geometric quantities (distance from the upper, lower, left and right boundaries of the text box)
Figure BDA0001830362250000163
IOU loss in between, the IOU loss function is defined as follows:
Figure BDA0001830362250000164
represents the intersection of the two text boxes,
Figure BDA0001830362250000165
representing the union of two text boxes.
In step 1402, the error generated by text region detection performed by the network model and the error generated by performing character recognition operation are added in a weighted manner to obtain the text recognition error of the network model.
Specifically, the loss function of the entire network model, i.e., the text recognition error of the network model, is expressed as follows:
L total =L Detectionrecognition L recogtion
L Detection to detect loss of branch generation, L recogtion The penalty incurred for recognizing a branch, i.e. the error, epsilon, incurred by performing a character recognition operation recognition The loss of the identification branch accounts for the proportion of the total loss of the model, so as to control the contribution degree of the identification branch to the optimization of the whole model. The penalty incurred by the detection branch has been calculated in step 1401 and the penalty incurred by the identification branch is expressed as follows:
Figure BDA0001830362250000166
r is the number of the areas to be identified,
Figure BDA0001830362250000167
is the identification label of the first region, is the input currently identified by ρ,
Figure BDA0001830362250000168
the calculation formula of (a) is as follows:
Figure BDA0001830362250000169
c * is a character-level notation sequence, c * ={c 0 ,...,c L-1 L is the length of the marked sequence, L is less than or equal to 7439, 7439 is the number of characters in the dictionary, and only the characters in the dictionary can be recognized.
It should be noted that, in the detection task, the loss function of the bounding box regression adopts an IOU (Intersection over Union) loss function, which has the following advantages compared with the L2 loss: the four coordinates of the frame are used as a whole to be studied and optimized, so that the training difficulty of the model is reduced, the detection accuracy and the learning speed of the model can be improved, and meanwhile, the diversity adaptability of the sample is enhanced.
The scheme provided by the invention can support web api (network application program interface) service calling and mobile terminal deployment, and as shown in fig. 15, by adopting the technical scheme provided by the invention, specific character contents can be directly identified from an original image and output.
The following is an embodiment of an apparatus of the present invention, which may be used to execute an embodiment of a method for recognizing a text in an image executed by the user equipment 110 according to the present invention. For details that are not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method for recognizing text in an image of the present invention.
Fig. 16 is a block diagram illustrating an apparatus for recognizing text in an image according to an exemplary embodiment, which may be used in the user equipment 110 in the implementation environment shown in fig. 1 to perform all or part of the steps of the method for recognizing text in an image shown in any one of fig. 3, 6, 8, 9, 12-14. The device performs end-to-end recognition of text in an image through a network model with multiple layers of superposition, as shown in fig. 16, and the device includes but is not limited to: a spatial convolution operation module 1610, a global feature extraction module 1630, a pooled feature obtaining module 1650 and a character sequence output module 1670.
A spatial convolution operation module 1610, configured to perform spatial separable convolution operation on an image layer by layer in a multi-layer manner, and fuse convolution features extracted by the spatial separable convolution operation to a lower layer mapped by layer-by-layer superposition, where the lower layer is mapped to a higher layer outputting the convolution features;
a global feature extraction module 1630 configured to obtain global features from the lowest layer that performs the spatially separable convolution operation;
a pooling feature obtaining module 1650, configured to perform candidate region detection and region screening parameter prediction on a text in the image according to the global feature, so as to obtain a pooling feature corresponding to the detected text region;
a character sequence output module 1670, configured to propagate the pooled features backward to a recognition branch network layer that performs a character recognition operation, and output the character sequence of the text region tag through the recognition branch network layer.
The implementation processes of the functions and actions of the modules in the device are specifically described in the implementation processes of the corresponding steps in the method for recognizing the text in the image, and are not described herein again.
The spatial convolution operation module 1610 may be, for example, one of the physical structure processors 218 in fig. 2.
The global feature extraction module 1630, the pooled feature obtaining module 1650 and the character sequence output module 1670 may also be functional modules, and are configured to execute corresponding steps in the method for recognizing text in an image. It is understood that these modules may be implemented in hardware, software, or a combination of both. When implemented in hardware, these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits. When implemented in software, the modules may be implemented as one or more computer programs executing on one or more processors, such as the programs stored in memory 204 and executed by processor 218 of FIG. 2.
Optionally, as shown in fig. 17, the pooled feature obtaining module 1650 includes, but is not limited to:
a candidate region output unit 1651 configured to input the global feature to a regional regression network layer that performs candidate region detection, and output a bounding box candidate region of a text in the image through the regional regression network layer;
a pooling input unit 1652 for inputting the bounding box candidate region into a pooling layer where region filtering and region rotation are performed;
a filtering rotation unit 1653, configured to filter the text region from the frame candidate region according to the pixel-level region filtering parameter obtained by performing region filtering parameter prediction on the global feature by the pooling layer, and rotate the text region to a horizontal position, so as to obtain the pooled feature of the text region.
Optionally, as shown in fig. 18, the screening rotary unit 1653 includes, but is not limited to:
a confidence obtaining subunit 1801, configured to obtain a pixel-level classification confidence that is generated by performing convolution calculation on the global feature by the pooling layer, where the pixel-level classification confidence is a probability that each pixel in the image belongs to a text region;
a candidate region screening subunit 1802, configured to screen out the text region from the frame candidate region according to the pixel-level classification confidence and the intersection proportion of the frame candidate region;
a text region rotation subunit 1803, configured to rotate the text region to a horizontal position through an interpolation algorithm according to a pixel-level rotation angle and a pixel-level border distance generated by performing convolution calculation on the global feature by the pooling layer, so as to obtain a pooling feature of the text region.
Optionally, the recognition branch network layer includes a time convolution network layer and a character classification layer, and the character sequence output module 1670 includes but is not limited to:
the character feature extraction unit is used for backward propagating the pooled features to the time convolution network layer to extract character features;
and the character classification unit is used for inputting the extracted character features into the character classification layer and outputting the character sequence of the text region mark through the character classification layer.
Optionally, the apparatus further includes but is not limited to:
the system comprises a sample set acquisition module, a sample image acquisition module and a text information acquisition module, wherein the sample image set is used for acquiring a sample image set recorded with text information on an image, and the content of the text information is known;
and the model training module is used for training the network model by utilizing the sample image set and enabling the difference between the character sequence of each sample image output by the network model and the corresponding text information to be minimum by adjusting the parameters of the network model.
Optionally, the model training module includes but is not limited to:
a model error obtaining unit, configured to obtain a text recognition error of the network model according to an error generated by performing text region detection on the network model and an error generated by performing a character recognition operation;
and the model parameter adjusting unit is used for adjusting the network layer parameters of the network model for text region detection and the network layer parameters for executing character recognition operation through back propagation according to the text recognition error so as to minimize the text recognition error.
Optionally, the model error obtaining unit includes but is not limited to:
the detection error determining subunit is used for determining an error generated by the network model for text region detection according to an error generated by pixel-level classification prediction of the network model, an error generated by pixel-level frame distance prediction and an error generated by pixel-level rotation angle prediction;
and the error fusion subunit is used for carrying out weighted addition on the error generated by text region detection of the network model and the error generated by executing character recognition operation to obtain the text recognition error of the network model.
Optionally, the present invention further provides an electronic device, which may be used in the user equipment 110 in the implementation environment shown in fig. 1 to execute all or part of the steps of the method for recognizing text in an image shown in any one of fig. 3, fig. 6, fig. 8, fig. 9, fig. 12 to fig. 14. The electronic device includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the method of recognizing text in an image according to the above exemplary embodiment.
The specific manner in which the processor of the electronic device performs the operations in this embodiment has been described in detail in the embodiment related to the method for recognizing text in the image, and will not be elaborated upon here.
In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be transitory and non-transitory computer-readable storage media, including instructions. The storage medium stores a computer program executable by the processor 218 of the apparatus 200 to perform the above-described method of recognizing text in an image.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (15)

1. A method for recognizing text in an image, the method performing end-to-end recognition of text in an image through a network model with multiple layers superimposed, the method comprising:
performing spatial separable convolution operation of the image layer by layer in a multilayer mode, fusing convolution features extracted by the spatial separable convolution operation to a lower layer mapped by layer-by-layer superposition, and mapping the lower layer with a higher layer outputting the convolution features;
obtaining global features from a bottom layer performing a spatially separable convolution operation;
performing candidate region detection and region screening parameter prediction of a text in the image through the global features to obtain pooling features corresponding to the detected text regions;
and backward propagating the pooled features to a recognition branch network layer which executes character recognition operation, and outputting the character sequence of the text region mark through the recognition branch network layer.
2. The method according to claim 1, wherein the candidate region detection and region screening parameter prediction of the text in the image are performed through the global feature, and obtaining the pooled feature corresponding to the detected text region comprises:
inputting the global features into a regional regression network layer for executing candidate region detection, and outputting frame candidate regions of texts in the images through the regional regression network layer;
inputting the frame candidate area into a pooling layer for performing area screening and area rotation;
and according to pixel level region screening parameters obtained by predicting region screening parameters of the global features by the pooling layer, screening the text region from the frame candidate region and rotating the text region to a horizontal position to obtain the pooling features of the text region.
3. The method according to claim 2, wherein the obtaining the pooled feature of the text region by filtering out the text region from the frame candidate region and rotating the text region to a horizontal position according to a pixel-level region filtering parameter obtained by predicting a region filtering parameter of the global feature by the pooling layer comprises:
obtaining a pixel-level classification confidence coefficient generated by the pooling layer through convolution calculation of the global features, wherein the pixel-level classification confidence coefficient is the probability of each pixel in the image belonging to a text region;
screening out the text region from the frame candidate region according to the pixel-level classification confidence coefficient and the intersection proportion of the frame candidate region;
and according to the pixel level rotation angle and the pixel level frame distance generated by performing convolution calculation on the global feature by the pooling layer, rotating the text region to a horizontal position by an interpolation algorithm to obtain the pooling feature of the text region.
4. The method of claim 1, wherein the recognition branching network layer comprises a temporal convolution network layer and a character classification layer, wherein the back-propagating the pooled features to the recognition branching network layer performing a character recognition operation, wherein outputting the sequence of characters of the text region label through the recognition branching network layer comprises:
backward propagating the pooled features to the time convolution network layer to extract character features;
inputting the extracted character features into the character classification layer, and outputting the character sequence of the text region marks through the character classification layer.
5. The method of claim 1, further comprising:
acquiring a sample image set recorded with text information on an image, wherein the content of the text information is known;
and training the network model by using the sample image set, and minimizing the difference between the character sequence of each sample image output by the network model and the corresponding text information by adjusting the parameters of the network model.
6. The method of claim 5, wherein the training of the network model using the sample image set to minimize the difference between the character sequence of each sample image output by the network model and the corresponding text information by adjusting parameters of the network model comprises:
acquiring a text recognition error of the network model according to an error generated by text region detection of the network model and an error generated by executing character recognition operation;
and adjusting the network layer parameters of the network model for text region detection and the network layer parameters for executing character recognition operation through back propagation according to the text recognition error, so that the text recognition error is minimized.
7. The method of claim 6, wherein the obtaining the text recognition error of the network model from the error generated by text region detection according to the network model and the error generated by performing a character recognition operation comprises:
determining an error generated by the network model for text region detection according to an error generated by pixel-level classification prediction of the network model, an error generated by pixel-level frame distance prediction and an error generated by pixel-level rotation angle prediction;
and carrying out weighted addition on an error generated by text region detection of the network model and an error generated by executing character recognition operation to obtain a text recognition error of the network model.
8. An apparatus for recognizing a text in an image, the apparatus performing an end-to-end recognition of the text in the image through a network model in which a plurality of layers are stacked, the apparatus comprising:
the spatial convolution operation module is used for carrying out spatial separable convolution operation on the image layer by layer in a multilayer mode, fusing convolution features extracted by the spatial separable convolution operation to a lower layer mapped by layer-by-layer superposition, and mapping the lower layer with a higher layer outputting the convolution features;
a global feature extraction module for obtaining global features from a lowest layer performing a spatially separable convolution operation;
the pooling feature obtaining module is used for detecting candidate regions of texts in the images and predicting region screening parameters through the global features to obtain pooling features corresponding to the detected text regions;
and the character sequence output module is used for backward propagating the pooled features to a recognition branch network layer for executing character recognition operation, and outputting the character sequence of the text region mark through the recognition branch network layer.
9. The apparatus of claim 8, wherein the pooled feature acquisition module comprises:
the candidate region output unit is used for inputting the global features into a regional regression network layer for executing candidate region detection and outputting frame candidate regions of texts in the images through the regional regression network layer;
a pooling input unit for inputting the bounding box candidate region into a pooling layer for performing region screening and region rotation;
and the screening rotation unit is used for screening the text region from the frame candidate region and rotating the text region to a horizontal position according to a pixel level region screening parameter obtained by predicting a region screening parameter of the global feature by the pooling layer, so as to obtain the pooling feature of the text region.
10. The apparatus of claim 9, wherein the screening rotating unit comprises:
a confidence obtaining subunit, configured to obtain a pixel-level classification confidence generated by performing convolution calculation on the global features by the pooling layer, where the pixel-level classification confidence is a probability that each pixel in the image belongs to a text region;
a candidate region screening subunit, configured to screen out the text region from the frame candidate region according to the pixel-level classification confidence and the intersection proportion of the frame candidate region;
and the text region rotation subunit is configured to rotate the text region to a horizontal position through an interpolation algorithm according to a pixel-level rotation angle and a pixel-level border distance, which are generated by performing convolution calculation on the global feature by the pooling layer, so as to obtain the pooled feature of the text region.
11. The apparatus of claim 8, wherein the recognition branching network layer comprises a time convolution network layer and a character classification layer, and wherein the character sequence output module comprises:
the character feature extraction unit is used for backward propagating the pooled features to the time convolution network layer to extract character features;
and the character classification unit is used for inputting the extracted character features into the character classification layer and outputting the character sequence of the text region mark through the character classification layer.
12. The apparatus of claim 8, further comprising:
the system comprises a sample set acquisition module, a sample image acquisition module and a text information acquisition module, wherein the sample image set is used for acquiring a sample image set recorded with text information on an image, and the content of the text information is known;
and the model training module is used for training the network model by utilizing the sample image set and enabling the difference between the character sequence of each sample image output by the network model and the corresponding text information to be minimum by adjusting the parameters of the network model.
13. The apparatus of claim 12, wherein the model training module comprises:
a model error obtaining unit, configured to obtain a text recognition error of the network model according to an error generated by text region detection performed by the network model and an error generated by performing a character recognition operation;
and the model parameter adjusting unit is used for adjusting the network layer parameters of the network model for text region detection and the network layer parameters for executing character recognition operation through back propagation according to the text recognition error so as to minimize the text recognition error.
14. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of recognizing text in an image according to any one of claims 1 to 7.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which is executable by a processor to perform the method for recognizing text in an image according to any one of claims 1 to 7.
CN201811202558.2A 2018-10-16 2018-10-16 Method and device for recognizing text in image, electronic equipment and storage medium Active CN109271967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811202558.2A CN109271967B (en) 2018-10-16 2018-10-16 Method and device for recognizing text in image, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811202558.2A CN109271967B (en) 2018-10-16 2018-10-16 Method and device for recognizing text in image, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109271967A CN109271967A (en) 2019-01-25
CN109271967B true CN109271967B (en) 2022-08-26

Family

ID=65196737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811202558.2A Active CN109271967B (en) 2018-10-16 2018-10-16 Method and device for recognizing text in image, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109271967B (en)

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919014B (en) * 2019-01-28 2023-11-03 平安科技(深圳)有限公司 OCR (optical character recognition) method and electronic equipment thereof
CN109948469B (en) * 2019-03-01 2022-11-29 吉林大学 Automatic inspection robot instrument detection and identification method based on deep learning
CN111723627A (en) * 2019-03-22 2020-09-29 北京搜狗科技发展有限公司 Image processing method and device and electronic equipment
CN110119681B (en) * 2019-04-04 2023-11-24 平安科技(深圳)有限公司 Text line extraction method and device and electronic equipment
CN110059188B (en) * 2019-04-11 2022-06-21 四川黑马数码科技有限公司 Chinese emotion analysis method based on bidirectional time convolution network
CN110210581B (en) * 2019-04-28 2023-11-24 平安科技(深圳)有限公司 Handwriting text recognition method and device and electronic equipment
CN110135411B (en) * 2019-04-30 2021-09-10 北京邮电大学 Business card recognition method and device
CN110110652B (en) * 2019-05-05 2021-10-22 达闼科技(北京)有限公司 Target detection method, electronic device and storage medium
CN110175610B (en) * 2019-05-23 2023-09-05 上海交通大学 Bill image text recognition method supporting privacy protection
CN110135424B (en) * 2019-05-23 2021-06-11 阳光保险集团股份有限公司 Inclined text detection model training method and ticket image text detection method
CN110276345B (en) * 2019-06-05 2021-09-17 北京字节跳动网络技术有限公司 Convolutional neural network model training method and device and computer readable storage medium
CN110232713B (en) * 2019-06-13 2022-09-20 腾讯数码(天津)有限公司 Image target positioning correction method and related equipment
CN110414520A (en) * 2019-06-28 2019-11-05 平安科技(深圳)有限公司 Universal character recognition methods, device, computer equipment and storage medium
CN110458011A (en) * 2019-07-05 2019-11-15 北京百度网讯科技有限公司 Character recognition method and device, computer equipment and readable medium end to end
CN110442860A (en) * 2019-07-05 2019-11-12 大连大学 Name entity recognition method based on time convolutional network
CN110610175A (en) * 2019-08-06 2019-12-24 深圳市华付信息技术有限公司 OCR data mislabeling cleaning method
CN112258259A (en) * 2019-08-14 2021-01-22 北京京东尚科信息技术有限公司 Data processing method, device and computer readable storage medium
CN110533041B (en) * 2019-09-05 2022-07-01 重庆邮电大学 Regression-based multi-scale scene text detection method
CN110705547B (en) * 2019-09-06 2023-08-18 中国平安财产保险股份有限公司 Method and device for recognizing text in image and computer readable storage medium
CN110738203B (en) * 2019-09-06 2024-04-05 中国平安财产保险股份有限公司 Field structured output method, device and computer readable storage medium
CN110610166B (en) * 2019-09-18 2022-06-07 北京猎户星空科技有限公司 Text region detection model training method and device, electronic equipment and storage medium
CN110751146B (en) * 2019-10-23 2023-06-20 北京印刷学院 Text region detection method, device, electronic terminal and computer readable storage medium
CN110807459B (en) * 2019-10-31 2022-06-17 深圳市捷顺科技实业股份有限公司 License plate correction method and device and readable storage medium
CN111104941B (en) * 2019-11-14 2023-06-13 腾讯科技(深圳)有限公司 Image direction correction method and device and electronic equipment
CN111091123A (en) * 2019-12-02 2020-05-01 上海眼控科技股份有限公司 Text region detection method and equipment
CN111104934A (en) * 2019-12-22 2020-05-05 上海眼控科技股份有限公司 Engine label detection method, electronic device and computer readable storage medium
CN113128306A (en) * 2020-01-10 2021-07-16 北京字节跳动网络技术有限公司 Vertical text line recognition method, device, equipment and computer readable storage medium
CN111259773A (en) * 2020-01-13 2020-06-09 中国科学院重庆绿色智能技术研究院 Irregular text line identification method and system based on bidirectional decoding
CN111462095B (en) * 2020-04-03 2024-04-09 上海帆声图像科技有限公司 Automatic parameter adjusting method for industrial flaw image detection
CN111488883A (en) * 2020-04-14 2020-08-04 上海眼控科技股份有限公司 Vehicle frame number identification method and device, computer equipment and storage medium
CN111598087B (en) * 2020-05-15 2023-05-23 华润数字科技有限公司 Irregular character recognition method, device, computer equipment and storage medium
CN113762259A (en) * 2020-09-02 2021-12-07 北京沃东天骏信息技术有限公司 Text positioning method, text positioning device, computer system and readable storage medium
CN112798949A (en) * 2020-10-22 2021-05-14 国家电网有限公司 Pumped storage unit generator temperature early warning method and system
CN112101360B (en) * 2020-11-17 2021-04-27 浙江大华技术股份有限公司 Target detection method and device and computer readable storage medium
CN112508015A (en) * 2020-12-15 2021-03-16 山东大学 Nameplate identification method, computer equipment and storage medium
CN112580637B (en) * 2020-12-31 2023-05-12 苏宁金融科技(南京)有限公司 Text information identification method, text information extraction method, text information identification device, text information extraction device and text information extraction system
CN113076815B (en) * 2021-03-16 2022-09-27 西南交通大学 Automatic driving direction prediction method based on lightweight neural network
CN113537189A (en) * 2021-06-03 2021-10-22 深圳市雄帝科技股份有限公司 Handwritten character recognition method, device, equipment and storage medium
CN113591864B (en) * 2021-07-28 2023-04-07 北京百度网讯科技有限公司 Training method, device and system for text recognition model framework
CN114842464A (en) * 2022-05-13 2022-08-02 北京百度网讯科技有限公司 Image direction recognition method, device, equipment, storage medium and program product
CN115205861B (en) * 2022-08-17 2023-03-31 北京睿企信息科技有限公司 Method for acquiring abnormal character recognition area, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107305630A (en) * 2016-04-25 2017-10-31 腾讯科技(深圳)有限公司 Text sequence recognition methods and device
CN108345850A (en) * 2018-01-23 2018-07-31 哈尔滨工业大学 The scene text detection method of the territorial classification of stroke feature transformation and deep learning based on super-pixel

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016171901A1 (en) * 2015-04-20 2016-10-27 3M Innovative Properties Company Dual embedded optical character recognition (ocr) engines

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107305630A (en) * 2016-04-25 2017-10-31 腾讯科技(深圳)有限公司 Text sequence recognition methods and device
CN108345850A (en) * 2018-01-23 2018-07-31 哈尔滨工业大学 The scene text detection method of the territorial classification of stroke feature transformation and deep learning based on super-pixel

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《EffNet: AN EFFICIENT STRUCTURE FOR CONVOLUTIONAL NEURAL NETWORKS》;Ido Freeman,et al;《Computer Vision and Pattern Recognition》;20180605;1-7 *
《Towards end-to-end text spotting with convolutional recurrent neural networks》;Hui Li,et al;《2017 IEEE International Conference on Computer Vision (ICCV)》;20171022;5238-5246 *

Also Published As

Publication number Publication date
CN109271967A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN109271967B (en) Method and device for recognizing text in image, electronic equipment and storage medium
CN112052787B (en) Target detection method and device based on artificial intelligence and electronic equipment
US10007867B2 (en) Systems and methods for identifying entities directly from imagery
WO2022213879A1 (en) Target object detection method and apparatus, and computer device and storage medium
US10902056B2 (en) Method and apparatus for processing image
US20190171903A1 (en) Optimizations for Dynamic Object Instance Detection, Segmentation, and Structure Mapping
WO2020151167A1 (en) Target tracking method and device, computer device and readable storage medium
WO2018224873A1 (en) Method and system for close loop perception in autonomous driving vehicles
CN109584276A (en) Critical point detection method, apparatus, equipment and readable medium
US10943151B2 (en) Systems and methods for training and validating a computer vision model for geospatial imagery
CN109189879B (en) Electronic book display method and device
US20190340746A1 (en) Stationary object detecting method, apparatus and electronic device
WO2018224877A1 (en) Method and system for integrated global and distributed learning in autonomous driving vehicles
US20200357131A1 (en) Methods and Systems for Detecting and Assigning Attributes to Objects of Interest in Geospatial Imagery
US20200143238A1 (en) Detecting Augmented-Reality Targets
US20230186517A1 (en) Method, apparatus, and computer program product for displaying virtual graphical data based on digital signatures
US20170039450A1 (en) Identifying Entities to be Investigated Using Storefront Recognition
CN112232311B (en) Face tracking method and device and electronic equipment
US20230035366A1 (en) Image classification model training method and apparatus, computer device, and storage medium
CN111832561B (en) Character sequence recognition method, device, equipment and medium based on computer vision
CN114385662A (en) Road network updating method and device, storage medium and electronic equipment
CN112036517B (en) Image defect classification method and device and electronic equipment
CN110263779A (en) Text filed detection method and device, Method for text detection, computer-readable medium
CN110781809A (en) Identification method and device based on registration feature update and electronic equipment
CN113269730B (en) Image processing method, image processing device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant