CN111444876A

CN111444876A - Image-text processing method and system and computer readable storage medium

Info

Publication number: CN111444876A
Application number: CN202010268468.4A
Authority: CN
Inventors: 陶民泽
Original assignee: E Capital Transfer Co ltd
Current assignee: E Capital Transfer Co ltd
Priority date: 2020-04-08
Filing date: 2020-04-08
Publication date: 2020-07-24

Abstract

The invention relates to an image-text processing method, which comprises the following steps: acquiring an image of a picture-text mixed file, and preprocessing the image; dividing the preprocessed image into regions through a first neural network; determining a text region in the divided image through a second neural network; and performing text recognition on the text region through a third neural network.

Description

Image-text processing method and system and computer readable storage medium

Technical Field

The invention relates to a mechanism for processing a graphics-text mixed file, in particular to a graphics-text processing method, a system and a computer readable storage medium.

Background

When data collection is performed, content identification needs to be performed on a file with mixed images and texts, for example, in order to collect identity information, an identity card image needs to be collected and identified. However, the conventional recognition method has defects in many aspects such as feature extraction, character region detection, and text recognition.

Disclosure of Invention

Therefore, in order to efficiently and accurately perform content identification, particularly character identification, on a file with mixed graphics and texts, the invention provides a mechanism for processing the file with mixed graphics and texts, and specifically comprises the following steps:

according to an aspect of the present invention, there is provided an image-text processing method, including the steps of: acquiring an image of a picture-text mixed file, and preprocessing the image; dividing the preprocessed image into regions through a first neural network; determining a text region in the divided image through a second neural network; and performing text recognition on the text region through a third neural network.

In some embodiments of the present invention, optionally, a portrait area in the divided image is determined and cropped.

In some embodiments of the present invention, optionally, the method further comprises establishing a mapping relationship between the portrait area and the recognized text.

In some embodiments of the invention, optionally the first neural network is a YO L O network, the preprocessed image is input to a YO L O network, which is regionally partitioned with a YO L O network.

In some embodiments of the present invention, optionally, the second neural network is a CTPN network, the divided image is input to the CTPN network, and the text region therein is determined by using the CTPN network.

In some embodiments of the present invention, optionally, the third neural network is a CRNN network, and the text region is input to the CRNN network, and text recognition is performed by using the CRNN network.

In some embodiments of the invention, optionally, the teletext file is an identity document.

According to another aspect of the present invention, there is provided a computer readable storage medium having instructions stored therein, wherein the instructions, when executed by a processor, cause the processor to perform any one of the methods as described above.

According to another aspect of the present invention, there is provided a graphics processing system comprising: the image preprocessing unit is configured to acquire an image of the image-text mixed file and preprocess the image; a dividing unit configured to divide the preprocessed image into regions by a first neural network therein; a determination unit configured to determine a text region in the divided image through a second neural network therein; and a recognition unit configured to perform text recognition on the text region through a third neural network therein.

In some embodiments of the present invention, optionally, the determining unit is further configured to determine a portrait area in the divided image and perform cropping.

In some embodiments of the present invention, optionally, the system further comprises a mapping unit configured to establish a mapping relationship between the portrait area and the recognized text.

In some embodiments of the invention, optionally the first neural network is a YO L O network, the YO L O network receiving the preprocessed image and regionalizing it.

In some embodiments of the present invention, optionally, the second neural network is a CTPN network that receives the partitioned image and determines text regions therein.

In some embodiments of the present invention, optionally, the third neural network is a CRNN network, and the CRNN network receives the text region and performs text recognition.

Drawings

The above and other objects and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which like or similar elements are designated by like reference numerals.

Fig. 1 shows a teletext processing method according to an embodiment of the invention.

Fig. 2 shows a teletext processing system according to an embodiment of the invention.

Detailed Description

For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of teletext processing methods, systems and computer-readable storage media, and that these same or similar principles may be implemented therein, with any such variations not departing from the true spirit and scope of the present patent application.

In the context of the present invention, Optical Character Recognition (OCR) technology refers to the process of an electronic device (e.g., a scanner or digital camera) examining characters on paper or other surfaces, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using Character Recognition methods. Namely, the process of scanning the text data, then analyzing and processing the image file and obtaining the character and layout information.

Image detection is sometimes referred to as object detection in the context of the present invention, the task of which is to find all objects of interest in an image, determine their position and size, and is one of the core problems in the field of machine vision. Because various objects have interference of factors such as different appearances, shapes, postures, imaging illumination, shielding and the like, target detection is always the most challenging problem in the field of machine vision.

Image recognition, which is sometimes referred to as image classification in the context of the present invention, may be considered as giving a test picture as input, and outputting to which class the picture belongs. Text recognition in the context of the present invention refers to recognizing the content of a text line on the basis of the positioning of the text line, and converting text information in an image into text information. The main problem to be solved by word recognition is what each word is. Convolutional neural networks in the context of the present invention refer to a class of feed forward neural networks that contain convolutional calculations and have a deep structure, which is one of the representative algorithms for deep learning. A convolved feature map in the context of the present invention refers to the output of a convolutional neural network, used to extract abstract features of an image.

In the conventional technical scheme, the image file of the text data is analyzed and identified through optical character identification to obtain the characters and the layout information. I.e. the text in the image is recognized and returned in the form of text. More specifically, OCR recognition of identification cards is divided into two stages, the first stage being text detection and the second stage being text recognition.

The method comprises the steps of utilizing an image detection method and a feed-forward neural network to process and extract features of an image based on methods such as digital image processing and traditional machine learning, wherein the common binarization processing is favorable for enhancing text information of a simple scene, but the effect of binarization of a complex background is very small.

However, unlike the object detection of other daily scenes, the distribution of the text image is closer to the uniform distribution rather than the normal distribution, i.e., the average image of the text population does not represent the feature of the abstract concept of text. In addition, the aspect ratio of the text is different from the aspect ratio of the object, which may cause the anchor box candidates to be inapplicable. In addition, the direction of the characters can not be determined, and the characters are well represented in the non-vertical character direction. Some structures often appear in natural scenes to be very close to characters, resulting in an increased false positive rate. Adjustments to existing models are therefore required.

The method includes the steps of generating a plurality of recognition results due to the splitting position, for example, if a character is split into girls when the splitting is not proper, performing over-splitting on candidate characters to enable the candidate characters to be broken sufficiently, then combining split fragments through dynamic planning to obtain an optimal combination, wherein a loss function needs to be designed manually.

According to an aspect of the present invention, there is provided a method of image-text processing. Fig. 1 shows a teletext processing method according to an embodiment of the invention, as shown, the method comprising the steps of: in step 102, an image of the image-text mixed file is obtained, and the image is preprocessed. The text-text mixed file referred to herein is a file in which pictures and text are mixed and arranged, and also includes a file composed of only text, and such a file exists as a specific example of a text-text mixed file, that is, such a file does not include picture contents after being identified. Specific examples of the teletext document may include a passport page, a resident identification card, a job resume, and the like. The preprocessing comprises color correction, distortion correction and the like, and the acquired image is more suitable for subsequent operation after being preprocessed.

In step 104, the preprocessed image is area partitioned. This step divides the pre-processed image into a plurality of regions (including also the special case of a single region). In particular, regions in the image may be framed according to their characteristics. In some examples, the content of a single box may serve as one area; in other examples, the content of multiple frame selections may be defined as one region. For example, the image may be divided into regions by a neural network, and the division result may be transmitted to the downstream.

In step 106, text regions in the divided image are determined. From the above, one or more regions have been boxed in step 104. In this step, text regions can be found from these regions based on the unique properties of the text. For example, which of them is a text region may be determined by a neural network, and the result of the determination is passed downstream.

In step 108, text recognition is performed on the text region. After the text areas are found, the content in the text areas can be subjected to character recognition. For example, word recognition may be performed by a neural network.

As can be seen from the above, in some embodiments of the present invention, the region division, the text region determination, and the text recognition are completely independent, and the tasks of the corresponding steps can be completed by using the corresponding neural networks respectively. The method can greatly improve the accuracy and efficiency of character recognition.

In some embodiments of the present invention, a portrait area in the divided image is determined and cropped. For the text-mixed file, it is sometimes necessary to extract the portrait associated with the text recognition result and other image information, and in some examples, the portrait or other image may be clipped.

In some embodiments of the invention, the method further comprises establishing a mapping of the portrait area to the recognized text. In some cases, the various content in the collected teletext file may be linked. For example, if the mixed-typeset document is a passport page, the photo of the passport holder may be associated with the identity information thereof, and specifically, a mapping relationship between the photo and the identity information may be established. This information can be used for subsequent retrieval purposes: for example, the identity information can be determined by matching the portrait identification to a specific photo; on the other hand, the photo data can be checked back through the identity information.

In some embodiments of the present invention, the preprocessed image is input to the YO L O network, which is regionalized using the YO L O network the YO L O algorithm is split into two parts, the Darknet-53 and the detection regression part, which are the backbone networks.

The classic convolutional layer structure of Darknet-53 is such that there are a total of 53 convolutional layers from layer 0 through layer 74, and the remainder is the res layer.

The detection regression part is constructed by a characteristic interaction layer of a YO L O network from the 75 th layer to the 105 th layer and is divided into three scales, local characteristic interaction is realized in each scale by means of a convolution kernel, and the function of the part is to perform regression and classification on the coordinates and the classes of the object to be detected based on the characteristics extracted by the backbone network.

YO L O divides the input image into S-S grids based on the convolution characteristic diagram of the backbone network, if the coordinate of the center position of a labeling frame of a certain object falls into a certain grid, the confidence coefficient of the grid is set as 1, each grid predicts B labeling frames and the confidence coefficient thereof, and C classification probabilities.

The process of solving the problem employs an Adam optimizer. Adam (adaptive motion estimation) is essentially RMSprop with momentum terms that dynamically adjusts the learning rate of each parameter using first and second Moment estimates of the gradient. Adam has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.

In some examples of the invention, the processing steps using CTPN are as follows, first, feature extraction is performed using VGG16 as a backbone network, resulting in a convolution feature map size of W H C, second, features are extracted from the convolution feature map from the previous step using a 3X 3 sliding window, and these features are used to predict multiple riveting frames, then the features from the previous step are input into a bi-directional L STM, the result of W256 is output, and this result is input into a 512-dimensional fully connected layer, again, the output from classification or regression.

In some embodiments of the invention, a text region is input into a CRNN network, and text recognition is performed by using the CRNN network, assuming that the size of an input image is in a (32, 100, 3) form, from bottom to top, CNN (convolutional layer) is used for extracting features from the input image by using depth CNN to obtain a feature map, RNN (cyclic layer) is used for predicting a feature sequence by using bidirectional RNN (B L STM), learning each feature vector in the sequence and outputting a predicted tag (true value) distribution, and CTC loss is used for converting a series of tag distributions acquired from the cyclic layer into a final tag sequence.

The CRNN convolutional layer consists of a convolutional layer and a maximum pooling layer in a standard CNN model, and a characteristic sequence of an input image is automatically extracted. Unlike the normal CNN network, CRNN scales the input image to the same height (the image width remains the same) before training, using a height value of 32. The vectors in the extracted feature sequence are generated from the feature map from left to right in sequence, each feature vector represents a feature on the image in a certain width, and the width is 1, namely a single pixel.

The cyclic layer is composed of a bidirectional L STM cyclic neural network, the label distribution (probability list of real results) of each feature vector in the feature sequence is predicted, the error of the cyclic layer is propagated reversely, and finally converted into the feature sequence, and the feature sequence is fed back to the convolutional layer.

For a sequence with the length of T, each sample point T (T is far greater than T) outputs a softmax vector at the last layer of the RNN network to represent the prediction probability of the sample point, and after the probabilities of all the sample points are transmitted to a CTC model, the most possible labels are output, and the final sequence label can be obtained through space removal and duplication removal.

In one or more of the above examples, the basic assumption of CTPN is that a single character is easier to detect than a text line with a higher degree of heterogeneity, and therefore, a single character is first detected like R-CNN, and then a bidirectional L STM is added to the detection network, so that the detection result forms a sequence providing the context features of the text, and a plurality of characters can be combined to obtain the text line.

In some embodiments of the invention, the teletext file is an identity document. For example, resident identification cards may be processed using the principles described above.

As shown in fig. 2, the system 20 comprises a preprocessing unit 202, a dividing unit 204, a determining unit 206 and a recognition unit 208.

Wherein the pre-processing unit 202 is configured to obtain an image of the teletext file and pre-process the image. The text-text mixed file referred to herein is a file in which pictures and text are mixed and arranged, and also includes a file composed of only text, and such a file exists as a specific example of a text-text mixed file, that is, such a file does not include picture contents after being identified. Specific examples of the teletext document may include a passport page, a resident identification card, a job resume, and the like. The preprocessing comprises color correction, distortion correction and the like, and the acquired image is more suitable for subsequent operation after being preprocessed.

The dividing unit 204 is configured to divide the preprocessed image into regions. The dividing unit 204 divides the preprocessed image into a plurality of regions (including a special case of a single region as well). In particular, regions in the image may be framed according to their characteristics. In some examples, the content of a single box may serve as one area; in other examples, the content of multiple frame selections may be defined as one region. For example, the image may be divided into regions by a neural network included in the dividing unit 204, and the division result may be transferred to other unit modules.

The determination unit 206 is configured to determine a text region in the divided image. From the above, it can be seen that the partitioning unit 204 has outlined one or more regions. At this time, the determination unit 206 may find the text region from these regions according to the unique attribute of the text. For example, which of them is the text region may be determined by a neural network included in the determination unit 206, and the result of the determination is transferred to the other unit modules.

The recognition unit 208 is configured to perform text recognition on the text region. After the text areas are found, the content in the text areas can be subjected to character recognition. For example, the character recognition may be performed by a neural network included in the recognition unit 208.

As can be seen from the above, in some embodiments of the present invention, the region division, the text region determination, and the text recognition are completely independent, and the tasks of the corresponding unit modules can be specifically completed by using the corresponding neural networks. The method can greatly improve the accuracy and efficiency of character recognition.

In some embodiments of the invention, the determining unit is further configured to determine a portrait area in the divided image and to crop. For the text-mixed file, it is sometimes necessary to extract the portrait associated with the text recognition result and other image information, and in some examples, the portrait or other image may be clipped.

In some embodiments of the invention, the system further comprises a mapping unit configured to establish a mapping relationship of the portrait area to the recognized text. In some cases, the various content in the collected teletext file may be linked. For example, if the mixed-typeset document is a passport page, the photo of the passport holder may be associated with the identity information thereof, and specifically, a mapping relationship between the photo and the identity information may be established. This information can be used for subsequent retrieval purposes: for example, the identity information can be determined by matching the portrait identification to a specific photo; on the other hand, the photo data can be checked back through the identity information.

In some embodiments of the present invention, the partitioning unit comprises a YO L O network, which receives the preprocessed image and partitions it into regions, the YO L O algorithm is split into two parts, a Darknet-53 and a detection regression part, which are the backbone networks.

In some examples of the invention, the process steps using CTPN are as follows, first, feature extraction is performed using VGG16 as a backbone network, resulting in a convolution feature map of size W H C, second, features are extracted from the convolution feature map from the previous step using a 3X 3 sliding window, and these features are used to predict multiple riveting boxes, then the features from the previous step are input to a bi-directional L STM, the result of W256 is output, and this result is input to a 512-dimensional fully connected layer.

In some embodiments of the invention, the recognition unit comprises a CRNN network which receives text regions and performs text recognition, assuming that the size of an input image is in the form of (32, 100, 3), from bottom to top, CNN (convolutional layer) which extracts features from the input image using depth CNN to obtain a feature map, RNN (cyclic layer) which predicts a feature sequence using bidirectional RNN (B L STM), learns each feature vector in the sequence, and outputs a predicted tag (true value) distribution, CTC loss which converts a series of tag distributions obtained from the cyclic layer into a final tag sequence using CTC loss.

In summary, the present invention provides a method, a system and a computer readable storage medium for image and text processing, which implement region division, text detection and text recognition independently, and in some examples, perform processing specifically through a neural network with specific processing advantages, thereby providing accuracy and efficiency of recognition. It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The above examples mainly illustrate the teletext processing method, system and computer-readable storage medium of the invention. Although only a few embodiments of the present invention have been described, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. An image-text processing method is characterized by comprising the following steps:

acquiring an image of a picture-text mixed file, and preprocessing the image;

dividing the preprocessed image into regions through a first neural network;

determining a text region in the divided image through a second neural network; and

and performing text recognition on the text region through a third neural network.

2. The method of claim 1, wherein a portrait area in the divided image is determined and cropped.

3. The method of claim 2, further comprising establishing a mapping of the portrait area to recognized text.

4. The method according to any of claims 1-3, characterized in that the first neural network is a YO L O network, the preprocessed image is input to a YO L O network, which is regionally partitioned with a YO L O network.

5. The method of claim 4, wherein the second neural network is a CTPN network, the divided image is input to the CTPN network, and the text region is determined using the CTPN network.

6. The method of claim 5, wherein the third neural network is a CRNN network, and wherein the text region is input to the CRNN network for text recognition using the CRNN network.

7. The method of claim 1, wherein the teletext file is an identity document.

8. A computer-readable storage medium having instructions stored therein, which when executed by a processor, cause the processor to perform the method of any one of claims 1-7.

9. A graphics processing system, the system comprising:

the image preprocessing unit is configured to acquire an image of the image-text mixed file and preprocess the image;

a dividing unit configured to divide the preprocessed image into regions by a first neural network therein;

a determination unit configured to determine a text region in the divided image through a second neural network therein; and

a recognition unit configured to perform text recognition on the text region through a third neural network therein.

10. The system of claim 9, wherein the determining unit is further configured to determine a portrait area in the divided image and crop.

11. The system according to claim 10, further comprising a mapping unit configured to establish a mapping relationship of the portrait area to the recognized text.

12. The system according to any of claims 9-11, wherein the first neural network is a YO L O network, a YO L O network receiving the pre-processed image and regionalizing it.

13. The system of claim 12, wherein the second neural network is a CTPN network that receives the partitioned image and determines text regions therein.

14. The system of claim 13, wherein the third neural network is a CRNN network that receives the text region and performs text recognition.

15. The system of claim 9, wherein the teletext file is an identity document.