CN111444876A - Image-text processing method and system and computer readable storage medium - Google Patents
Image-text processing method and system and computer readable storage medium Download PDFInfo
- Publication number
- CN111444876A CN111444876A CN202010268468.4A CN202010268468A CN111444876A CN 111444876 A CN111444876 A CN 111444876A CN 202010268468 A CN202010268468 A CN 202010268468A CN 111444876 A CN111444876 A CN 111444876A
- Authority
- CN
- China
- Prior art keywords
- image
- text
- network
- neural network
- recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims abstract description 40
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000000034 method Methods 0.000 claims description 28
- 102100032202 Cornulin Human genes 0.000 claims description 16
- 101000920981 Homo sapiens Cornulin Proteins 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 13
- 238000013507 mapping Methods 0.000 claims description 11
- 238000001514 detection method Methods 0.000 description 16
- 238000013527 convolutional neural network Methods 0.000 description 12
- 125000004122 cyclic group Chemical group 0.000 description 10
- 239000013598 vector Substances 0.000 description 10
- 238000009826 distribution Methods 0.000 description 8
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 6
- 230000002457 bidirectional effect Effects 0.000 description 6
- 238000012937 correction Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000003993 interaction Effects 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012015 optical character recognition Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Human Computer Interaction (AREA)
- Character Discrimination (AREA)
Abstract
The invention relates to an image-text processing method, which comprises the following steps: acquiring an image of a picture-text mixed file, and preprocessing the image; dividing the preprocessed image into regions through a first neural network; determining a text region in the divided image through a second neural network; and performing text recognition on the text region through a third neural network.
Description
Technical Field
The invention relates to a mechanism for processing a graphics-text mixed file, in particular to a graphics-text processing method, a system and a computer readable storage medium.
Background
When data collection is performed, content identification needs to be performed on a file with mixed images and texts, for example, in order to collect identity information, an identity card image needs to be collected and identified. However, the conventional recognition method has defects in many aspects such as feature extraction, character region detection, and text recognition.
Disclosure of Invention
Therefore, in order to efficiently and accurately perform content identification, particularly character identification, on a file with mixed graphics and texts, the invention provides a mechanism for processing the file with mixed graphics and texts, and specifically comprises the following steps:
according to an aspect of the present invention, there is provided an image-text processing method, including the steps of: acquiring an image of a picture-text mixed file, and preprocessing the image; dividing the preprocessed image into regions through a first neural network; determining a text region in the divided image through a second neural network; and performing text recognition on the text region through a third neural network.
In some embodiments of the present invention, optionally, a portrait area in the divided image is determined and cropped.
In some embodiments of the present invention, optionally, the method further comprises establishing a mapping relationship between the portrait area and the recognized text.
In some embodiments of the invention, optionally the first neural network is a YO L O network, the preprocessed image is input to a YO L O network, which is regionally partitioned with a YO L O network.
In some embodiments of the present invention, optionally, the second neural network is a CTPN network, the divided image is input to the CTPN network, and the text region therein is determined by using the CTPN network.
In some embodiments of the present invention, optionally, the third neural network is a CRNN network, and the text region is input to the CRNN network, and text recognition is performed by using the CRNN network.
In some embodiments of the invention, optionally, the teletext file is an identity document.
According to another aspect of the present invention, there is provided a computer readable storage medium having instructions stored therein, wherein the instructions, when executed by a processor, cause the processor to perform any one of the methods as described above.
According to another aspect of the present invention, there is provided a graphics processing system comprising: the image preprocessing unit is configured to acquire an image of the image-text mixed file and preprocess the image; a dividing unit configured to divide the preprocessed image into regions by a first neural network therein; a determination unit configured to determine a text region in the divided image through a second neural network therein; and a recognition unit configured to perform text recognition on the text region through a third neural network therein.
In some embodiments of the present invention, optionally, the determining unit is further configured to determine a portrait area in the divided image and perform cropping.
In some embodiments of the present invention, optionally, the system further comprises a mapping unit configured to establish a mapping relationship between the portrait area and the recognized text.
In some embodiments of the invention, optionally the first neural network is a YO L O network, the YO L O network receiving the preprocessed image and regionalizing it.
In some embodiments of the present invention, optionally, the second neural network is a CTPN network that receives the partitioned image and determines text regions therein.
In some embodiments of the present invention, optionally, the third neural network is a CRNN network, and the CRNN network receives the text region and performs text recognition.
In some embodiments of the invention, optionally, the teletext file is an identity document.
Drawings
The above and other objects and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which like or similar elements are designated by like reference numerals.
Fig. 1 shows a teletext processing method according to an embodiment of the invention.
Fig. 2 shows a teletext processing system according to an embodiment of the invention.
Detailed Description
For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of teletext processing methods, systems and computer-readable storage media, and that these same or similar principles may be implemented therein, with any such variations not departing from the true spirit and scope of the present patent application.
In the context of the present invention, Optical Character Recognition (OCR) technology refers to the process of an electronic device (e.g., a scanner or digital camera) examining characters on paper or other surfaces, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using Character Recognition methods. Namely, the process of scanning the text data, then analyzing and processing the image file and obtaining the character and layout information.
Image detection is sometimes referred to as object detection in the context of the present invention, the task of which is to find all objects of interest in an image, determine their position and size, and is one of the core problems in the field of machine vision. Because various objects have interference of factors such as different appearances, shapes, postures, imaging illumination, shielding and the like, target detection is always the most challenging problem in the field of machine vision.
Image recognition, which is sometimes referred to as image classification in the context of the present invention, may be considered as giving a test picture as input, and outputting to which class the picture belongs. Text recognition in the context of the present invention refers to recognizing the content of a text line on the basis of the positioning of the text line, and converting text information in an image into text information. The main problem to be solved by word recognition is what each word is. Convolutional neural networks in the context of the present invention refer to a class of feed forward neural networks that contain convolutional calculations and have a deep structure, which is one of the representative algorithms for deep learning. A convolved feature map in the context of the present invention refers to the output of a convolutional neural network, used to extract abstract features of an image.
In the conventional technical scheme, the image file of the text data is analyzed and identified through optical character identification to obtain the characters and the layout information. I.e. the text in the image is recognized and returned in the form of text. More specifically, OCR recognition of identification cards is divided into two stages, the first stage being text detection and the second stage being text recognition.
The method comprises the steps of utilizing an image detection method and a feed-forward neural network to process and extract features of an image based on methods such as digital image processing and traditional machine learning, wherein the common binarization processing is favorable for enhancing text information of a simple scene, but the effect of binarization of a complex background is very small.
However, unlike the object detection of other daily scenes, the distribution of the text image is closer to the uniform distribution rather than the normal distribution, i.e., the average image of the text population does not represent the feature of the abstract concept of text. In addition, the aspect ratio of the text is different from the aspect ratio of the object, which may cause the anchor box candidates to be inapplicable. In addition, the direction of the characters can not be determined, and the characters are well represented in the non-vertical character direction. Some structures often appear in natural scenes to be very close to characters, resulting in an increased false positive rate. Adjustments to existing models are therefore required.
The method includes the steps of generating a plurality of recognition results due to the splitting position, for example, if a character is split into girls when the splitting is not proper, performing over-splitting on candidate characters to enable the candidate characters to be broken sufficiently, then combining split fragments through dynamic planning to obtain an optimal combination, wherein a loss function needs to be designed manually.
According to an aspect of the present invention, there is provided a method of image-text processing. Fig. 1 shows a teletext processing method according to an embodiment of the invention, as shown, the method comprising the steps of: in step 102, an image of the image-text mixed file is obtained, and the image is preprocessed. The text-text mixed file referred to herein is a file in which pictures and text are mixed and arranged, and also includes a file composed of only text, and such a file exists as a specific example of a text-text mixed file, that is, such a file does not include picture contents after being identified. Specific examples of the teletext document may include a passport page, a resident identification card, a job resume, and the like. The preprocessing comprises color correction, distortion correction and the like, and the acquired image is more suitable for subsequent operation after being preprocessed.
In step 104, the preprocessed image is area partitioned. This step divides the pre-processed image into a plurality of regions (including also the special case of a single region). In particular, regions in the image may be framed according to their characteristics. In some examples, the content of a single box may serve as one area; in other examples, the content of multiple frame selections may be defined as one region. For example, the image may be divided into regions by a neural network, and the division result may be transmitted to the downstream.
In step 106, text regions in the divided image are determined. From the above, one or more regions have been boxed in step 104. In this step, text regions can be found from these regions based on the unique properties of the text. For example, which of them is a text region may be determined by a neural network, and the result of the determination is passed downstream.
In step 108, text recognition is performed on the text region. After the text areas are found, the content in the text areas can be subjected to character recognition. For example, word recognition may be performed by a neural network.
As can be seen from the above, in some embodiments of the present invention, the region division, the text region determination, and the text recognition are completely independent, and the tasks of the corresponding steps can be completed by using the corresponding neural networks respectively. The method can greatly improve the accuracy and efficiency of character recognition.
In some embodiments of the present invention, a portrait area in the divided image is determined and cropped. For the text-mixed file, it is sometimes necessary to extract the portrait associated with the text recognition result and other image information, and in some examples, the portrait or other image may be clipped.
In some embodiments of the invention, the method further comprises establishing a mapping of the portrait area to the recognized text. In some cases, the various content in the collected teletext file may be linked. For example, if the mixed-typeset document is a passport page, the photo of the passport holder may be associated with the identity information thereof, and specifically, a mapping relationship between the photo and the identity information may be established. This information can be used for subsequent retrieval purposes: for example, the identity information can be determined by matching the portrait identification to a specific photo; on the other hand, the photo data can be checked back through the identity information.
In some embodiments of the present invention, the preprocessed image is input to the YO L O network, which is regionalized using the YO L O network the YO L O algorithm is split into two parts, the Darknet-53 and the detection regression part, which are the backbone networks.
The classic convolutional layer structure of Darknet-53 is such that there are a total of 53 convolutional layers from layer 0 through layer 74, and the remainder is the res layer.
The detection regression part is constructed by a characteristic interaction layer of a YO L O network from the 75 th layer to the 105 th layer and is divided into three scales, local characteristic interaction is realized in each scale by means of a convolution kernel, and the function of the part is to perform regression and classification on the coordinates and the classes of the object to be detected based on the characteristics extracted by the backbone network.
YO L O divides the input image into S-S grids based on the convolution characteristic diagram of the backbone network, if the coordinate of the center position of a labeling frame of a certain object falls into a certain grid, the confidence coefficient of the grid is set as 1, each grid predicts B labeling frames and the confidence coefficient thereof, and C classification probabilities.
The process of solving the problem employs an Adam optimizer. Adam (adaptive motion estimation) is essentially RMSprop with momentum terms that dynamically adjusts the learning rate of each parameter using first and second Moment estimates of the gradient. Adam has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.
In some examples of the invention, the processing steps using CTPN are as follows, first, feature extraction is performed using VGG16 as a backbone network, resulting in a convolution feature map size of W H C, second, features are extracted from the convolution feature map from the previous step using a 3X 3 sliding window, and these features are used to predict multiple riveting frames, then the features from the previous step are input into a bi-directional L STM, the result of W256 is output, and this result is input into a 512-dimensional fully connected layer, again, the output from classification or regression.
In some embodiments of the invention, a text region is input into a CRNN network, and text recognition is performed by using the CRNN network, assuming that the size of an input image is in a (32, 100, 3) form, from bottom to top, CNN (convolutional layer) is used for extracting features from the input image by using depth CNN to obtain a feature map, RNN (cyclic layer) is used for predicting a feature sequence by using bidirectional RNN (B L STM), learning each feature vector in the sequence and outputting a predicted tag (true value) distribution, and CTC loss is used for converting a series of tag distributions acquired from the cyclic layer into a final tag sequence.
The CRNN convolutional layer consists of a convolutional layer and a maximum pooling layer in a standard CNN model, and a characteristic sequence of an input image is automatically extracted. Unlike the normal CNN network, CRNN scales the input image to the same height (the image width remains the same) before training, using a height value of 32. The vectors in the extracted feature sequence are generated from the feature map from left to right in sequence, each feature vector represents a feature on the image in a certain width, and the width is 1, namely a single pixel.
The cyclic layer is composed of a bidirectional L STM cyclic neural network, the label distribution (probability list of real results) of each feature vector in the feature sequence is predicted, the error of the cyclic layer is propagated reversely, and finally converted into the feature sequence, and the feature sequence is fed back to the convolutional layer.
For a sequence with the length of T, each sample point T (T is far greater than T) outputs a softmax vector at the last layer of the RNN network to represent the prediction probability of the sample point, and after the probabilities of all the sample points are transmitted to a CTC model, the most possible labels are output, and the final sequence label can be obtained through space removal and duplication removal.
In one or more of the above examples, the basic assumption of CTPN is that a single character is easier to detect than a text line with a higher degree of heterogeneity, and therefore, a single character is first detected like R-CNN, and then a bidirectional L STM is added to the detection network, so that the detection result forms a sequence providing the context features of the text, and a plurality of characters can be combined to obtain the text line.
In some embodiments of the invention, the teletext file is an identity document. For example, resident identification cards may be processed using the principles described above.
According to another aspect of the present invention, there is provided a computer readable storage medium having instructions stored therein, wherein the instructions, when executed by a processor, cause the processor to perform any one of the methods as described above.
As shown in fig. 2, the system 20 comprises a preprocessing unit 202, a dividing unit 204, a determining unit 206 and a recognition unit 208.
Wherein the pre-processing unit 202 is configured to obtain an image of the teletext file and pre-process the image. The text-text mixed file referred to herein is a file in which pictures and text are mixed and arranged, and also includes a file composed of only text, and such a file exists as a specific example of a text-text mixed file, that is, such a file does not include picture contents after being identified. Specific examples of the teletext document may include a passport page, a resident identification card, a job resume, and the like. The preprocessing comprises color correction, distortion correction and the like, and the acquired image is more suitable for subsequent operation after being preprocessed.
The dividing unit 204 is configured to divide the preprocessed image into regions. The dividing unit 204 divides the preprocessed image into a plurality of regions (including a special case of a single region as well). In particular, regions in the image may be framed according to their characteristics. In some examples, the content of a single box may serve as one area; in other examples, the content of multiple frame selections may be defined as one region. For example, the image may be divided into regions by a neural network included in the dividing unit 204, and the division result may be transferred to other unit modules.
The determination unit 206 is configured to determine a text region in the divided image. From the above, it can be seen that the partitioning unit 204 has outlined one or more regions. At this time, the determination unit 206 may find the text region from these regions according to the unique attribute of the text. For example, which of them is the text region may be determined by a neural network included in the determination unit 206, and the result of the determination is transferred to the other unit modules.
The recognition unit 208 is configured to perform text recognition on the text region. After the text areas are found, the content in the text areas can be subjected to character recognition. For example, the character recognition may be performed by a neural network included in the recognition unit 208.
As can be seen from the above, in some embodiments of the present invention, the region division, the text region determination, and the text recognition are completely independent, and the tasks of the corresponding unit modules can be specifically completed by using the corresponding neural networks. The method can greatly improve the accuracy and efficiency of character recognition.
In some embodiments of the invention, the determining unit is further configured to determine a portrait area in the divided image and to crop. For the text-mixed file, it is sometimes necessary to extract the portrait associated with the text recognition result and other image information, and in some examples, the portrait or other image may be clipped.
In some embodiments of the invention, the system further comprises a mapping unit configured to establish a mapping relationship of the portrait area to the recognized text. In some cases, the various content in the collected teletext file may be linked. For example, if the mixed-typeset document is a passport page, the photo of the passport holder may be associated with the identity information thereof, and specifically, a mapping relationship between the photo and the identity information may be established. This information can be used for subsequent retrieval purposes: for example, the identity information can be determined by matching the portrait identification to a specific photo; on the other hand, the photo data can be checked back through the identity information.
In some embodiments of the present invention, the partitioning unit comprises a YO L O network, which receives the preprocessed image and partitions it into regions, the YO L O algorithm is split into two parts, a Darknet-53 and a detection regression part, which are the backbone networks.
The classic convolutional layer structure of Darknet-53 is such that there are a total of 53 convolutional layers from layer 0 through layer 74, and the remainder is the res layer.
The detection regression part is constructed by a characteristic interaction layer of a YO L O network from the 75 th layer to the 105 th layer and is divided into three scales, local characteristic interaction is realized in each scale by means of a convolution kernel, and the function of the part is to perform regression and classification on the coordinates and the classes of the object to be detected based on the characteristics extracted by the backbone network.
YO L O divides the input image into S-S grids based on the convolution characteristic diagram of the backbone network, if the coordinate of the center position of a labeling frame of a certain object falls into a certain grid, the confidence coefficient of the grid is set as 1, each grid predicts B labeling frames and the confidence coefficient thereof, and C classification probabilities.
The process of solving the problem employs an Adam optimizer. Adam (adaptive motion estimation) is essentially RMSprop with momentum terms that dynamically adjusts the learning rate of each parameter using first and second Moment estimates of the gradient. Adam has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.
In some examples of the invention, the process steps using CTPN are as follows, first, feature extraction is performed using VGG16 as a backbone network, resulting in a convolution feature map of size W H C, second, features are extracted from the convolution feature map from the previous step using a 3X 3 sliding window, and these features are used to predict multiple riveting boxes, then the features from the previous step are input to a bi-directional L STM, the result of W256 is output, and this result is input to a 512-dimensional fully connected layer.
In some embodiments of the invention, the recognition unit comprises a CRNN network which receives text regions and performs text recognition, assuming that the size of an input image is in the form of (32, 100, 3), from bottom to top, CNN (convolutional layer) which extracts features from the input image using depth CNN to obtain a feature map, RNN (cyclic layer) which predicts a feature sequence using bidirectional RNN (B L STM), learns each feature vector in the sequence, and outputs a predicted tag (true value) distribution, CTC loss which converts a series of tag distributions obtained from the cyclic layer into a final tag sequence using CTC loss.
The CRNN convolutional layer consists of a convolutional layer and a maximum pooling layer in a standard CNN model, and a characteristic sequence of an input image is automatically extracted. Unlike the normal CNN network, CRNN scales the input image to the same height (the image width remains the same) before training, using a height value of 32. The vectors in the extracted feature sequence are generated from the feature map from left to right in sequence, each feature vector represents a feature on the image in a certain width, and the width is 1, namely a single pixel.
The cyclic layer is composed of a bidirectional L STM cyclic neural network, the label distribution (probability list of real results) of each feature vector in the feature sequence is predicted, the error of the cyclic layer is propagated reversely, and finally converted into the feature sequence, and the feature sequence is fed back to the convolutional layer.
For a sequence with the length of T, each sample point T (T is far greater than T) outputs a softmax vector at the last layer of the RNN network to represent the prediction probability of the sample point, and after the probabilities of all the sample points are transmitted to a CTC model, the most possible labels are output, and the final sequence label can be obtained through space removal and duplication removal.
In one or more of the above examples, the basic assumption of CTPN is that a single character is easier to detect than a text line with a higher degree of heterogeneity, and therefore, a single character is first detected like R-CNN, and then a bidirectional L STM is added to the detection network, so that the detection result forms a sequence providing the context features of the text, and a plurality of characters can be combined to obtain the text line.
In some embodiments of the invention, the teletext file is an identity document. For example, resident identification cards may be processed using the principles described above.
In summary, the present invention provides a method, a system and a computer readable storage medium for image and text processing, which implement region division, text detection and text recognition independently, and in some examples, perform processing specifically through a neural network with specific processing advantages, thereby providing accuracy and efficiency of recognition. It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The above examples mainly illustrate the teletext processing method, system and computer-readable storage medium of the invention. Although only a few embodiments of the present invention have been described, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.
Claims (15)
1. An image-text processing method is characterized by comprising the following steps:
acquiring an image of a picture-text mixed file, and preprocessing the image;
dividing the preprocessed image into regions through a first neural network;
determining a text region in the divided image through a second neural network; and
and performing text recognition on the text region through a third neural network.
2. The method of claim 1, wherein a portrait area in the divided image is determined and cropped.
3. The method of claim 2, further comprising establishing a mapping of the portrait area to recognized text.
4. The method according to any of claims 1-3, characterized in that the first neural network is a YO L O network, the preprocessed image is input to a YO L O network, which is regionally partitioned with a YO L O network.
5. The method of claim 4, wherein the second neural network is a CTPN network, the divided image is input to the CTPN network, and the text region is determined using the CTPN network.
6. The method of claim 5, wherein the third neural network is a CRNN network, and wherein the text region is input to the CRNN network for text recognition using the CRNN network.
7. The method of claim 1, wherein the teletext file is an identity document.
8. A computer-readable storage medium having instructions stored therein, which when executed by a processor, cause the processor to perform the method of any one of claims 1-7.
9. A graphics processing system, the system comprising:
the image preprocessing unit is configured to acquire an image of the image-text mixed file and preprocess the image;
a dividing unit configured to divide the preprocessed image into regions by a first neural network therein;
a determination unit configured to determine a text region in the divided image through a second neural network therein; and
a recognition unit configured to perform text recognition on the text region through a third neural network therein.
10. The system of claim 9, wherein the determining unit is further configured to determine a portrait area in the divided image and crop.
11. The system according to claim 10, further comprising a mapping unit configured to establish a mapping relationship of the portrait area to the recognized text.
12. The system according to any of claims 9-11, wherein the first neural network is a YO L O network, a YO L O network receiving the pre-processed image and regionalizing it.
13. The system of claim 12, wherein the second neural network is a CTPN network that receives the partitioned image and determines text regions therein.
14. The system of claim 13, wherein the third neural network is a CRNN network that receives the text region and performs text recognition.
15. The system of claim 9, wherein the teletext file is an identity document.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010268468.4A CN111444876A (en) | 2020-04-08 | 2020-04-08 | Image-text processing method and system and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010268468.4A CN111444876A (en) | 2020-04-08 | 2020-04-08 | Image-text processing method and system and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111444876A true CN111444876A (en) | 2020-07-24 |
Family
ID=71650124
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010268468.4A Pending CN111444876A (en) | 2020-04-08 | 2020-04-08 | Image-text processing method and system and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111444876A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818949A (en) * | 2021-03-09 | 2021-05-18 | 浙江天派科技有限公司 | Method and system for identifying delivery certificate characters |
CN116994270A (en) * | 2023-08-28 | 2023-11-03 | 乐麦信息技术(杭州)有限公司 | Resume analysis method, device, equipment and readable storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015196084A1 (en) * | 2014-06-20 | 2015-12-23 | Theodore Kuklinski | A self-learning system and methods for automatic document recognition, authentication, and information extraction |
CN108595544A (en) * | 2018-04-09 | 2018-09-28 | 深源恒际科技有限公司 | A kind of document picture classification method |
CN109344815A (en) * | 2018-12-13 | 2019-02-15 | 深源恒际科技有限公司 | A kind of file and picture classification method |
CN109389121A (en) * | 2018-10-30 | 2019-02-26 | 金现代信息产业股份有限公司 | A kind of nameplate recognition methods and system based on deep learning |
CN110135446A (en) * | 2018-02-09 | 2019-08-16 | 北京世纪好未来教育科技有限公司 | Method for text detection and computer storage medium |
CN110443184A (en) * | 2019-07-31 | 2019-11-12 | 上海海事大学 | ID card information extracting method, device and computer storage medium |
CN110956171A (en) * | 2019-11-06 | 2020-04-03 | 广州供电局有限公司 | Automatic nameplate identification method and device, computer equipment and storage medium |
-
2020
- 2020-04-08 CN CN202010268468.4A patent/CN111444876A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015196084A1 (en) * | 2014-06-20 | 2015-12-23 | Theodore Kuklinski | A self-learning system and methods for automatic document recognition, authentication, and information extraction |
CN110135446A (en) * | 2018-02-09 | 2019-08-16 | 北京世纪好未来教育科技有限公司 | Method for text detection and computer storage medium |
CN108595544A (en) * | 2018-04-09 | 2018-09-28 | 深源恒际科技有限公司 | A kind of document picture classification method |
CN109389121A (en) * | 2018-10-30 | 2019-02-26 | 金现代信息产业股份有限公司 | A kind of nameplate recognition methods and system based on deep learning |
CN109344815A (en) * | 2018-12-13 | 2019-02-15 | 深源恒际科技有限公司 | A kind of file and picture classification method |
CN110443184A (en) * | 2019-07-31 | 2019-11-12 | 上海海事大学 | ID card information extracting method, device and computer storage medium |
CN110956171A (en) * | 2019-11-06 | 2020-04-03 | 广州供电局有限公司 | Automatic nameplate identification method and device, computer equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
余若男;黄定江;董启文;: "基于深度学习的场景文字检测研究进展", no. 05 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818949A (en) * | 2021-03-09 | 2021-05-18 | 浙江天派科技有限公司 | Method and system for identifying delivery certificate characters |
CN116994270A (en) * | 2023-08-28 | 2023-11-03 | 乐麦信息技术(杭州)有限公司 | Resume analysis method, device, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6843086B2 (en) | Image processing systems, methods for performing multi-label semantic edge detection in images, and non-temporary computer-readable storage media | |
CA3027038C (en) | Document field detection and parsing | |
WO2019192397A1 (en) | End-to-end recognition method for scene text in any shape | |
US7236632B2 (en) | Automated techniques for comparing contents of images | |
US8594431B2 (en) | Adaptive partial character recognition | |
EP1598770B1 (en) | Low resolution optical character recognition for camera acquired documents | |
US9965695B1 (en) | Document image binarization method based on content type separation | |
CN111401372A (en) | Method for extracting and identifying image-text information of scanned document | |
JP5176763B2 (en) | Low quality character identification method and apparatus | |
US20100111375A1 (en) | Method for Determining Atributes of Faces in Images | |
US20030012438A1 (en) | Multiple size reductions for image segmentation | |
CN111444876A (en) | Image-text processing method and system and computer readable storage medium | |
Verma et al. | Removal of obstacles in Devanagari script for efficient optical character recognition | |
Natei et al. | Extracting text from image document and displaying its related information | |
Hoxha et al. | Remote sensing image captioning with SVM-based decoding | |
Karanje et al. | Survey on text detection, segmentation and recognition from a natural scene images | |
Ghoshal et al. | An improved scene text and document image binarization scheme | |
Kaur et al. | Page segmentation in OCR system-a review | |
JP2017084006A (en) | Image processor and method thereof | |
Valiente et al. | A process for text recognition of generic identification documents over cloud computing | |
Qin et al. | Laba: Logical layout analysis of book page images in arabic using multiple support vector machines | |
Sahota et al. | An empirical enhancement using scale invariant feature transform in text extraction from images | |
CN111353353A (en) | Cross-posture face recognition method and device | |
Shivani | Techniques of Text Detection and Recognition: A Survey | |
CN117649672B (en) | Font type visual detection method and system based on active learning and transfer learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |