CN111444876A - Image-text processing method and system and computer readable storage medium - Google Patents

Image-text processing method and system and computer readable storage medium Download PDF

Info

Publication number
CN111444876A
CN111444876A CN202010268468.4A CN202010268468A CN111444876A CN 111444876 A CN111444876 A CN 111444876A CN 202010268468 A CN202010268468 A CN 202010268468A CN 111444876 A CN111444876 A CN 111444876A
Authority
CN
China
Prior art keywords
image
text
network
neural network
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010268468.4A
Other languages
Chinese (zh)
Inventor
陶民泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
E Capital Transfer Co ltd
Original Assignee
E Capital Transfer Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by E Capital Transfer Co ltd filed Critical E Capital Transfer Co ltd
Priority to CN202010268468.4A priority Critical patent/CN111444876A/en
Publication of CN111444876A publication Critical patent/CN111444876A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to an image-text processing method, which comprises the following steps: acquiring an image of a picture-text mixed file, and preprocessing the image; dividing the preprocessed image into regions through a first neural network; determining a text region in the divided image through a second neural network; and performing text recognition on the text region through a third neural network.

Description

Image-text processing method and system and computer readable storage medium
Technical Field
The invention relates to a mechanism for processing a graphics-text mixed file, in particular to a graphics-text processing method, a system and a computer readable storage medium.
Background
When data collection is performed, content identification needs to be performed on a file with mixed images and texts, for example, in order to collect identity information, an identity card image needs to be collected and identified. However, the conventional recognition method has defects in many aspects such as feature extraction, character region detection, and text recognition.
Disclosure of Invention
Therefore, in order to efficiently and accurately perform content identification, particularly character identification, on a file with mixed graphics and texts, the invention provides a mechanism for processing the file with mixed graphics and texts, and specifically comprises the following steps:
according to an aspect of the present invention, there is provided an image-text processing method, including the steps of: acquiring an image of a picture-text mixed file, and preprocessing the image; dividing the preprocessed image into regions through a first neural network; determining a text region in the divided image through a second neural network; and performing text recognition on the text region through a third neural network.
In some embodiments of the present invention, optionally, a portrait area in the divided image is determined and cropped.
In some embodiments of the present invention, optionally, the method further comprises establishing a mapping relationship between the portrait area and the recognized text.
In some embodiments of the invention, optionally the first neural network is a YO L O network, the preprocessed image is input to a YO L O network, which is regionally partitioned with a YO L O network.
In some embodiments of the present invention, optionally, the second neural network is a CTPN network, the divided image is input to the CTPN network, and the text region therein is determined by using the CTPN network.
In some embodiments of the present invention, optionally, the third neural network is a CRNN network, and the text region is input to the CRNN network, and text recognition is performed by using the CRNN network.
In some embodiments of the invention, optionally, the teletext file is an identity document.
According to another aspect of the present invention, there is provided a computer readable storage medium having instructions stored therein, wherein the instructions, when executed by a processor, cause the processor to perform any one of the methods as described above.
According to another aspect of the present invention, there is provided a graphics processing system comprising: the image preprocessing unit is configured to acquire an image of the image-text mixed file and preprocess the image; a dividing unit configured to divide the preprocessed image into regions by a first neural network therein; a determination unit configured to determine a text region in the divided image through a second neural network therein; and a recognition unit configured to perform text recognition on the text region through a third neural network therein.
In some embodiments of the present invention, optionally, the determining unit is further configured to determine a portrait area in the divided image and perform cropping.
In some embodiments of the present invention, optionally, the system further comprises a mapping unit configured to establish a mapping relationship between the portrait area and the recognized text.
In some embodiments of the invention, optionally the first neural network is a YO L O network, the YO L O network receiving the preprocessed image and regionalizing it.
In some embodiments of the present invention, optionally, the second neural network is a CTPN network that receives the partitioned image and determines text regions therein.
In some embodiments of the present invention, optionally, the third neural network is a CRNN network, and the CRNN network receives the text region and performs text recognition.
In some embodiments of the invention, optionally, the teletext file is an identity document.
Drawings
The above and other objects and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which like or similar elements are designated by like reference numerals.
Fig. 1 shows a teletext processing method according to an embodiment of the invention.
Fig. 2 shows a teletext processing system according to an embodiment of the invention.
Detailed Description
For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of teletext processing methods, systems and computer-readable storage media, and that these same or similar principles may be implemented therein, with any such variations not departing from the true spirit and scope of the present patent application.
In the context of the present invention, Optical Character Recognition (OCR) technology refers to the process of an electronic device (e.g., a scanner or digital camera) examining characters on paper or other surfaces, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using Character Recognition methods. Namely, the process of scanning the text data, then analyzing and processing the image file and obtaining the character and layout information.
Image detection is sometimes referred to as object detection in the context of the present invention, the task of which is to find all objects of interest in an image, determine their position and size, and is one of the core problems in the field of machine vision. Because various objects have interference of factors such as different appearances, shapes, postures, imaging illumination, shielding and the like, target detection is always the most challenging problem in the field of machine vision.
Image recognition, which is sometimes referred to as image classification in the context of the present invention, may be considered as giving a test picture as input, and outputting to which class the picture belongs. Text recognition in the context of the present invention refers to recognizing the content of a text line on the basis of the positioning of the text line, and converting text information in an image into text information. The main problem to be solved by word recognition is what each word is. Convolutional neural networks in the context of the present invention refer to a class of feed forward neural networks that contain convolutional calculations and have a deep structure, which is one of the representative algorithms for deep learning. A convolved feature map in the context of the present invention refers to the output of a convolutional neural network, used to extract abstract features of an image.
In the conventional technical scheme, the image file of the text data is analyzed and identified through optical character identification to obtain the characters and the layout information. I.e. the text in the image is recognized and returned in the form of text. More specifically, OCR recognition of identification cards is divided into two stages, the first stage being text detection and the second stage being text recognition.
The method comprises the steps of utilizing an image detection method and a feed-forward neural network to process and extract features of an image based on methods such as digital image processing and traditional machine learning, wherein the common binarization processing is favorable for enhancing text information of a simple scene, but the effect of binarization of a complex background is very small.
However, unlike the object detection of other daily scenes, the distribution of the text image is closer to the uniform distribution rather than the normal distribution, i.e., the average image of the text population does not represent the feature of the abstract concept of text. In addition, the aspect ratio of the text is different from the aspect ratio of the object, which may cause the anchor box candidates to be inapplicable. In addition, the direction of the characters can not be determined, and the characters are well represented in the non-vertical character direction. Some structures often appear in natural scenes to be very close to characters, resulting in an increased false positive rate. Adjustments to existing models are therefore required.
The method includes the steps of generating a plurality of recognition results due to the splitting position, for example, if a character is split into girls when the splitting is not proper, performing over-splitting on candidate characters to enable the candidate characters to be broken sufficiently, then combining split fragments through dynamic planning to obtain an optimal combination, wherein a loss function needs to be designed manually.
According to an aspect of the present invention, there is provided a method of image-text processing. Fig. 1 shows a teletext processing method according to an embodiment of the invention, as shown, the method comprising the steps of: in step 102, an image of the image-text mixed file is obtained, and the image is preprocessed. The text-text mixed file referred to herein is a file in which pictures and text are mixed and arranged, and also includes a file composed of only text, and such a file exists as a specific example of a text-text mixed file, that is, such a file does not include picture contents after being identified. Specific examples of the teletext document may include a passport page, a resident identification card, a job resume, and the like. The preprocessing comprises color correction, distortion correction and the like, and the acquired image is more suitable for subsequent operation after being preprocessed.
In step 104, the preprocessed image is area partitioned. This step divides the pre-processed image into a plurality of regions (including also the special case of a single region). In particular, regions in the image may be framed according to their characteristics. In some examples, the content of a single box may serve as one area; in other examples, the content of multiple frame selections may be defined as one region. For example, the image may be divided into regions by a neural network, and the division result may be transmitted to the downstream.
In step 106, text regions in the divided image are determined. From the above, one or more regions have been boxed in step 104. In this step, text regions can be found from these regions based on the unique properties of the text. For example, which of them is a text region may be determined by a neural network, and the result of the determination is passed downstream.
In step 108, text recognition is performed on the text region. After the text areas are found, the content in the text areas can be subjected to character recognition. For example, word recognition may be performed by a neural network.
As can be seen from the above, in some embodiments of the present invention, the region division, the text region determination, and the text recognition are completely independent, and the tasks of the corresponding steps can be completed by using the corresponding neural networks respectively. The method can greatly improve the accuracy and efficiency of character recognition.
In some embodiments of the present invention, a portrait area in the divided image is determined and cropped. For the text-mixed file, it is sometimes necessary to extract the portrait associated with the text recognition result and other image information, and in some examples, the portrait or other image may be clipped.
In some embodiments of the invention, the method further comprises establishing a mapping of the portrait area to the recognized text. In some cases, the various content in the collected teletext file may be linked. For example, if the mixed-typeset document is a passport page, the photo of the passport holder may be associated with the identity information thereof, and specifically, a mapping relationship between the photo and the identity information may be established. This information can be used for subsequent retrieval purposes: for example, the identity information can be determined by matching the portrait identification to a specific photo; on the other hand, the photo data can be checked back through the identity information.
In some embodiments of the present invention, the preprocessed image is input to the YO L O network, which is regionalized using the YO L O network the YO L O algorithm is split into two parts, the Darknet-53 and the detection regression part, which are the backbone networks.
The classic convolutional layer structure of Darknet-53 is such that there are a total of 53 convolutional layers from layer 0 through layer 74, and the remainder is the res layer.
The detection regression part is constructed by a characteristic interaction layer of a YO L O network from the 75 th layer to the 105 th layer and is divided into three scales, local characteristic interaction is realized in each scale by means of a convolution kernel, and the function of the part is to perform regression and classification on the coordinates and the classes of the object to be detected based on the characteristics extracted by the backbone network.
YO L O divides the input image into S-S grids based on the convolution characteristic diagram of the backbone network, if the coordinate of the center position of a labeling frame of a certain object falls into a certain grid, the confidence coefficient of the grid is set as 1, each grid predicts B labeling frames and the confidence coefficient thereof, and C classification probabilities.
The process of solving the problem employs an Adam optimizer. Adam (adaptive motion estimation) is essentially RMSprop with momentum terms that dynamically adjusts the learning rate of each parameter using first and second Moment estimates of the gradient. Adam has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.
In some examples of the invention, the processing steps using CTPN are as follows, first, feature extraction is performed using VGG16 as a backbone network, resulting in a convolution feature map size of W H C, second, features are extracted from the convolution feature map from the previous step using a 3X 3 sliding window, and these features are used to predict multiple riveting frames, then the features from the previous step are input into a bi-directional L STM, the result of W256 is output, and this result is input into a 512-dimensional fully connected layer, again, the output from classification or regression.
In some embodiments of the invention, a text region is input into a CRNN network, and text recognition is performed by using the CRNN network, assuming that the size of an input image is in a (32, 100, 3) form, from bottom to top, CNN (convolutional layer) is used for extracting features from the input image by using depth CNN to obtain a feature map, RNN (cyclic layer) is used for predicting a feature sequence by using bidirectional RNN (B L STM), learning each feature vector in the sequence and outputting a predicted tag (true value) distribution, and CTC loss is used for converting a series of tag distributions acquired from the cyclic layer into a final tag sequence.
The CRNN convolutional layer consists of a convolutional layer and a maximum pooling layer in a standard CNN model, and a characteristic sequence of an input image is automatically extracted. Unlike the normal CNN network, CRNN scales the input image to the same height (the image width remains the same) before training, using a height value of 32. The vectors in the extracted feature sequence are generated from the feature map from left to right in sequence, each feature vector represents a feature on the image in a certain width, and the width is 1, namely a single pixel.
The cyclic layer is composed of a bidirectional L STM cyclic neural network, the label distribution (probability list of real results) of each feature vector in the feature sequence is predicted, the error of the cyclic layer is propagated reversely, and finally converted into the feature sequence, and the feature sequence is fed back to the convolutional layer.
For a sequence with the length of T, each sample point T (T is far greater than T) outputs a softmax vector at the last layer of the RNN network to represent the prediction probability of the sample point, and after the probabilities of all the sample points are transmitted to a CTC model, the most possible labels are output, and the final sequence label can be obtained through space removal and duplication removal.
In one or more of the above examples, the basic assumption of CTPN is that a single character is easier to detect than a text line with a higher degree of heterogeneity, and therefore, a single character is first detected like R-CNN, and then a bidirectional L STM is added to the detection network, so that the detection result forms a sequence providing the context features of the text, and a plurality of characters can be combined to obtain the text line.
In some embodiments of the invention, the teletext file is an identity document. For example, resident identification cards may be processed using the principles described above.
According to another aspect of the present invention, there is provided a computer readable storage medium having instructions stored therein, wherein the instructions, when executed by a processor, cause the processor to perform any one of the methods as described above.
As shown in fig. 2, the system 20 comprises a preprocessing unit 202, a dividing unit 204, a determining unit 206 and a recognition unit 208.
Wherein the pre-processing unit 202 is configured to obtain an image of the teletext file and pre-process the image. The text-text mixed file referred to herein is a file in which pictures and text are mixed and arranged, and also includes a file composed of only text, and such a file exists as a specific example of a text-text mixed file, that is, such a file does not include picture contents after being identified. Specific examples of the teletext document may include a passport page, a resident identification card, a job resume, and the like. The preprocessing comprises color correction, distortion correction and the like, and the acquired image is more suitable for subsequent operation after being preprocessed.
The dividing unit 204 is configured to divide the preprocessed image into regions. The dividing unit 204 divides the preprocessed image into a plurality of regions (including a special case of a single region as well). In particular, regions in the image may be framed according to their characteristics. In some examples, the content of a single box may serve as one area; in other examples, the content of multiple frame selections may be defined as one region. For example, the image may be divided into regions by a neural network included in the dividing unit 204, and the division result may be transferred to other unit modules.
The determination unit 206 is configured to determine a text region in the divided image. From the above, it can be seen that the partitioning unit 204 has outlined one or more regions. At this time, the determination unit 206 may find the text region from these regions according to the unique attribute of the text. For example, which of them is the text region may be determined by a neural network included in the determination unit 206, and the result of the determination is transferred to the other unit modules.
The recognition unit 208 is configured to perform text recognition on the text region. After the text areas are found, the content in the text areas can be subjected to character recognition. For example, the character recognition may be performed by a neural network included in the recognition unit 208.
As can be seen from the above, in some embodiments of the present invention, the region division, the text region determination, and the text recognition are completely independent, and the tasks of the corresponding unit modules can be specifically completed by using the corresponding neural networks. The method can greatly improve the accuracy and efficiency of character recognition.
In some embodiments of the invention, the determining unit is further configured to determine a portrait area in the divided image and to crop. For the text-mixed file, it is sometimes necessary to extract the portrait associated with the text recognition result and other image information, and in some examples, the portrait or other image may be clipped.
In some embodiments of the invention, the system further comprises a mapping unit configured to establish a mapping relationship of the portrait area to the recognized text. In some cases, the various content in the collected teletext file may be linked. For example, if the mixed-typeset document is a passport page, the photo of the passport holder may be associated with the identity information thereof, and specifically, a mapping relationship between the photo and the identity information may be established. This information can be used for subsequent retrieval purposes: for example, the identity information can be determined by matching the portrait identification to a specific photo; on the other hand, the photo data can be checked back through the identity information.
In some embodiments of the present invention, the partitioning unit comprises a YO L O network, which receives the preprocessed image and partitions it into regions, the YO L O algorithm is split into two parts, a Darknet-53 and a detection regression part, which are the backbone networks.
The classic convolutional layer structure of Darknet-53 is such that there are a total of 53 convolutional layers from layer 0 through layer 74, and the remainder is the res layer.
The detection regression part is constructed by a characteristic interaction layer of a YO L O network from the 75 th layer to the 105 th layer and is divided into three scales, local characteristic interaction is realized in each scale by means of a convolution kernel, and the function of the part is to perform regression and classification on the coordinates and the classes of the object to be detected based on the characteristics extracted by the backbone network.
YO L O divides the input image into S-S grids based on the convolution characteristic diagram of the backbone network, if the coordinate of the center position of a labeling frame of a certain object falls into a certain grid, the confidence coefficient of the grid is set as 1, each grid predicts B labeling frames and the confidence coefficient thereof, and C classification probabilities.
The process of solving the problem employs an Adam optimizer. Adam (adaptive motion estimation) is essentially RMSprop with momentum terms that dynamically adjusts the learning rate of each parameter using first and second Moment estimates of the gradient. Adam has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.
In some examples of the invention, the process steps using CTPN are as follows, first, feature extraction is performed using VGG16 as a backbone network, resulting in a convolution feature map of size W H C, second, features are extracted from the convolution feature map from the previous step using a 3X 3 sliding window, and these features are used to predict multiple riveting boxes, then the features from the previous step are input to a bi-directional L STM, the result of W256 is output, and this result is input to a 512-dimensional fully connected layer.
In some embodiments of the invention, the recognition unit comprises a CRNN network which receives text regions and performs text recognition, assuming that the size of an input image is in the form of (32, 100, 3), from bottom to top, CNN (convolutional layer) which extracts features from the input image using depth CNN to obtain a feature map, RNN (cyclic layer) which predicts a feature sequence using bidirectional RNN (B L STM), learns each feature vector in the sequence, and outputs a predicted tag (true value) distribution, CTC loss which converts a series of tag distributions obtained from the cyclic layer into a final tag sequence using CTC loss.
The CRNN convolutional layer consists of a convolutional layer and a maximum pooling layer in a standard CNN model, and a characteristic sequence of an input image is automatically extracted. Unlike the normal CNN network, CRNN scales the input image to the same height (the image width remains the same) before training, using a height value of 32. The vectors in the extracted feature sequence are generated from the feature map from left to right in sequence, each feature vector represents a feature on the image in a certain width, and the width is 1, namely a single pixel.
The cyclic layer is composed of a bidirectional L STM cyclic neural network, the label distribution (probability list of real results) of each feature vector in the feature sequence is predicted, the error of the cyclic layer is propagated reversely, and finally converted into the feature sequence, and the feature sequence is fed back to the convolutional layer.
For a sequence with the length of T, each sample point T (T is far greater than T) outputs a softmax vector at the last layer of the RNN network to represent the prediction probability of the sample point, and after the probabilities of all the sample points are transmitted to a CTC model, the most possible labels are output, and the final sequence label can be obtained through space removal and duplication removal.
In one or more of the above examples, the basic assumption of CTPN is that a single character is easier to detect than a text line with a higher degree of heterogeneity, and therefore, a single character is first detected like R-CNN, and then a bidirectional L STM is added to the detection network, so that the detection result forms a sequence providing the context features of the text, and a plurality of characters can be combined to obtain the text line.
In some embodiments of the invention, the teletext file is an identity document. For example, resident identification cards may be processed using the principles described above.
In summary, the present invention provides a method, a system and a computer readable storage medium for image and text processing, which implement region division, text detection and text recognition independently, and in some examples, perform processing specifically through a neural network with specific processing advantages, thereby providing accuracy and efficiency of recognition. It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The above examples mainly illustrate the teletext processing method, system and computer-readable storage medium of the invention. Although only a few embodiments of the present invention have been described, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims (15)

1. An image-text processing method is characterized by comprising the following steps:
acquiring an image of a picture-text mixed file, and preprocessing the image;
dividing the preprocessed image into regions through a first neural network;
determining a text region in the divided image through a second neural network; and
and performing text recognition on the text region through a third neural network.
2. The method of claim 1, wherein a portrait area in the divided image is determined and cropped.
3. The method of claim 2, further comprising establishing a mapping of the portrait area to recognized text.
4. The method according to any of claims 1-3, characterized in that the first neural network is a YO L O network, the preprocessed image is input to a YO L O network, which is regionally partitioned with a YO L O network.
5. The method of claim 4, wherein the second neural network is a CTPN network, the divided image is input to the CTPN network, and the text region is determined using the CTPN network.
6. The method of claim 5, wherein the third neural network is a CRNN network, and wherein the text region is input to the CRNN network for text recognition using the CRNN network.
7. The method of claim 1, wherein the teletext file is an identity document.
8. A computer-readable storage medium having instructions stored therein, which when executed by a processor, cause the processor to perform the method of any one of claims 1-7.
9. A graphics processing system, the system comprising:
the image preprocessing unit is configured to acquire an image of the image-text mixed file and preprocess the image;
a dividing unit configured to divide the preprocessed image into regions by a first neural network therein;
a determination unit configured to determine a text region in the divided image through a second neural network therein; and
a recognition unit configured to perform text recognition on the text region through a third neural network therein.
10. The system of claim 9, wherein the determining unit is further configured to determine a portrait area in the divided image and crop.
11. The system according to claim 10, further comprising a mapping unit configured to establish a mapping relationship of the portrait area to the recognized text.
12. The system according to any of claims 9-11, wherein the first neural network is a YO L O network, a YO L O network receiving the pre-processed image and regionalizing it.
13. The system of claim 12, wherein the second neural network is a CTPN network that receives the partitioned image and determines text regions therein.
14. The system of claim 13, wherein the third neural network is a CRNN network that receives the text region and performs text recognition.
15. The system of claim 9, wherein the teletext file is an identity document.
CN202010268468.4A 2020-04-08 2020-04-08 Image-text processing method and system and computer readable storage medium Pending CN111444876A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010268468.4A CN111444876A (en) 2020-04-08 2020-04-08 Image-text processing method and system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010268468.4A CN111444876A (en) 2020-04-08 2020-04-08 Image-text processing method and system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111444876A true CN111444876A (en) 2020-07-24

Family

ID=71650124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010268468.4A Pending CN111444876A (en) 2020-04-08 2020-04-08 Image-text processing method and system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111444876A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818949A (en) * 2021-03-09 2021-05-18 浙江天派科技有限公司 Method and system for identifying delivery certificate characters
CN116994270A (en) * 2023-08-28 2023-11-03 乐麦信息技术(杭州)有限公司 Resume analysis method, device, equipment and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015196084A1 (en) * 2014-06-20 2015-12-23 Theodore Kuklinski A self-learning system and methods for automatic document recognition, authentication, and information extraction
CN108595544A (en) * 2018-04-09 2018-09-28 深源恒际科技有限公司 A kind of document picture classification method
CN109344815A (en) * 2018-12-13 2019-02-15 深源恒际科技有限公司 A kind of file and picture classification method
CN109389121A (en) * 2018-10-30 2019-02-26 金现代信息产业股份有限公司 A kind of nameplate recognition methods and system based on deep learning
CN110135446A (en) * 2018-02-09 2019-08-16 北京世纪好未来教育科技有限公司 Method for text detection and computer storage medium
CN110443184A (en) * 2019-07-31 2019-11-12 上海海事大学 ID card information extracting method, device and computer storage medium
CN110956171A (en) * 2019-11-06 2020-04-03 广州供电局有限公司 Automatic nameplate identification method and device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015196084A1 (en) * 2014-06-20 2015-12-23 Theodore Kuklinski A self-learning system and methods for automatic document recognition, authentication, and information extraction
CN110135446A (en) * 2018-02-09 2019-08-16 北京世纪好未来教育科技有限公司 Method for text detection and computer storage medium
CN108595544A (en) * 2018-04-09 2018-09-28 深源恒际科技有限公司 A kind of document picture classification method
CN109389121A (en) * 2018-10-30 2019-02-26 金现代信息产业股份有限公司 A kind of nameplate recognition methods and system based on deep learning
CN109344815A (en) * 2018-12-13 2019-02-15 深源恒际科技有限公司 A kind of file and picture classification method
CN110443184A (en) * 2019-07-31 2019-11-12 上海海事大学 ID card information extracting method, device and computer storage medium
CN110956171A (en) * 2019-11-06 2020-04-03 广州供电局有限公司 Automatic nameplate identification method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
余若男;黄定江;董启文;: "基于深度学习的场景文字检测研究进展", no. 05 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818949A (en) * 2021-03-09 2021-05-18 浙江天派科技有限公司 Method and system for identifying delivery certificate characters
CN116994270A (en) * 2023-08-28 2023-11-03 乐麦信息技术(杭州)有限公司 Resume analysis method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
JP6843086B2 (en) Image processing systems, methods for performing multi-label semantic edge detection in images, and non-temporary computer-readable storage media
CA3027038C (en) Document field detection and parsing
WO2019192397A1 (en) End-to-end recognition method for scene text in any shape
US7236632B2 (en) Automated techniques for comparing contents of images
US8594431B2 (en) Adaptive partial character recognition
EP1598770B1 (en) Low resolution optical character recognition for camera acquired documents
US9965695B1 (en) Document image binarization method based on content type separation
CN111401372A (en) Method for extracting and identifying image-text information of scanned document
JP5176763B2 (en) Low quality character identification method and apparatus
US20100111375A1 (en) Method for Determining Atributes of Faces in Images
US20030012438A1 (en) Multiple size reductions for image segmentation
CN111444876A (en) Image-text processing method and system and computer readable storage medium
Verma et al. Removal of obstacles in Devanagari script for efficient optical character recognition
Natei et al. Extracting text from image document and displaying its related information
Hoxha et al. Remote sensing image captioning with SVM-based decoding
Karanje et al. Survey on text detection, segmentation and recognition from a natural scene images
Ghoshal et al. An improved scene text and document image binarization scheme
Kaur et al. Page segmentation in OCR system-a review
JP2017084006A (en) Image processor and method thereof
Valiente et al. A process for text recognition of generic identification documents over cloud computing
Qin et al. Laba: Logical layout analysis of book page images in arabic using multiple support vector machines
Sahota et al. An empirical enhancement using scale invariant feature transform in text extraction from images
CN111353353A (en) Cross-posture face recognition method and device
Shivani Techniques of Text Detection and Recognition: A Survey
CN117649672B (en) Font type visual detection method and system based on active learning and transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination