Method and system for positioning and segmenting overlapped text lines based on deep learning
Technical Field
The invention relates to the field of computer vision, in particular to a method and a system for positioning and segmenting overlapped text lines based on deep learning.
Background
In many application scenarios, there is a need to electronize document picture content to generate structured data and complete automated entry. Such requirements can be addressed using OCR (optical Character recognition) techniques. Generally, OCR technology includes two major steps of text detection (text detection) and text recognition (text recognition). Conventional text detection methods typically employ Connected Component Analysis (CCA) or Sliding Window detection (SW) mechanisms. These methods usually require manual design of a series of rules to extract low-level or medium-level features in the image, and complex pre-processing and post-processing procedures are combined to complete the task of text detection. The traditional methods are limited by limited feature representation capability of manual design rules and complex processing flows, and the traditional methods are difficult to have higher performance, especially in some difficult recognition scenes, such as fuzzy characters, overlapped characters, scene characters with complex backgrounds and the like.
In recent years, deep learning techniques have been rapidly developed and successfully applied to text detection and recognition tasks. In essence, deep learning pertains to feature learning algorithms that approximate a potential functional mapping from input to output by automatically learning and extracting features of the input objects (images, text, etc.) and fitting specific target output labels. A deep learning model is usually composed of a series of sequential operations that must be differentiable so that end-to-end training optimization can be achieved using optimization methods such as gradient descent.
Although the deep learning technique brings great improvement to the performance of the document text detection algorithm, and even has great improvement to the difficult scene text detection task, it has to be acknowledged that some special difficult text detection tasks still have great challenges, such as the detection of overlapped text lines. As shown in fig. 1. Such overlapping lines of text are present in large numbers in tickets, forms, and documents, etc. in pictures, often caused by offset, skewed, and even nested printing, among other reasons. If the detection and identification problems of the text can be well solved, the performance of structured input of the object is greatly improved, and therefore the method has a great practical application value.
Disclosure of Invention
The invention relates to a positioning segmentation method of overlapped text lines based on deep learning, which can well solve the detection problem of the overlapped text lines in various types of bills, forms and document images shot by a scanner, a high-speed shooting instrument and a mobile phone, provide more accurate text line region information for subsequent identification tasks, improve the accuracy of overall identification and further finish the automatic input work of structured data with high quality.
According to a first aspect of the present invention, there is provided a method for positioning and segmenting overlapped text lines based on deep learning, the method comprising the following steps:
step 1, inputting an original image containing overlapped text lines, and preprocessing the original image;
step 2, training the example segmentation full convolution neural network, inputting the preprocessed original image into the trained example segmentation full convolution neural network, and outputting a non-overlapping text line region feature score image, an overlapping text line region feature score image and a link information feature score image among text line region pixels;
step 3, acquiring outlines of the non-overlapping text line region and the overlapping text line region by a connected domain analysis method based on the non-overlapping text line region feature score map, the overlapping text line region feature score map and the link information feature score map among the text line region pixels;
step 4, combining the non-overlapped text line region to the overlapped text line region according to the outlines of the non-overlapped text line region and the overlapped text line region;
and 5, performing quadrilateral fitting on the combined text line region to obtain an external quadrilateral of the text line region, and realizing the positioning segmentation of the overlapped text lines.
Further, the step 1 specifically includes: the method comprises the steps of conducting boundary completion on input original images by N units, then conducting 1/M down-sampling, and obtaining preprocessed original images, wherein M and N are integers which are not less than 1, and M is an integral multiple of N.
Further, the step 2 specifically includes:
step 21: labeling each sample image in the training sample set by using a quadrilateral to represent the outline of a text line area, and generating a labeled file with labels;
step 22: sending the label file and the sample image into an example segmentation full convolution example segmentation network for training, wherein in order to complete supervision and learning of the overlapped text line, the example segmentation full convolution example segmentation network automatically calculates the outline of the overlapped text line region according to the outline of the text line region in the label file, and then the outline is taken as the supervision and learning target of the overlapped text line region, and the training process is completed by combining the outline of the non-overlapped text line region to form a primary training model;
step 23: testing the preliminary training model through a test sample set, evaluating the detection and segmentation precision of the non-overlapped text line region and the overlapped text line region, if the precision requirement is met, terminating the training process, and segmenting the full convolution neural network by taking the preliminary training model as a trained example; if the precision requirement is not met, increasing the training sample size, adjusting the structure and the training parameters of the example segmentation full-convolution example segmentation network, and repeating the training process until a trained example segmentation full-convolution neural network meeting the precision requirement is obtained;
step 24: inputting the preprocessed original image into a trained example segmentation full convolution neural network, and outputting a non-overlapping text line region feature score image, an overlapping text line region feature score image and a link information feature score image among text line region pixels.
Further, the step 3 specifically includes:
step 31: setting a first threshold value for the characteristic score map of the non-overlapping text line region, setting a second threshold value for the characteristic score map of the overlapping text line region, and setting a third threshold value for the characteristic score map of the link information between the pixels of the text line region;
step 32: performing binarization processing on the non-overlapping text line region characteristic score map according to a first threshold value, performing binarization processing on the overlapping text line region characteristic score map according to a second threshold value, performing binarization processing on the link information characteristic score map among the text line region pixels according to a third threshold value, obtaining non-overlapping text line region pixel points and background pixel points in the non-overlapping text line region characteristic score map, obtaining overlapping text line region pixel points and background pixel points in the overlapping text line region characteristic score map, and obtaining link state information and non-link state information in the link information characteristic score map among the text line region pixels;
step 33: and combining the link state information according to the pixel points of the non-overlapping text line region to obtain the pixel point region of the non-overlapping text line region, combining the link state information according to the pixel points of the overlapping text line region to obtain the pixel point region of the overlapping text line region, and expressing the outline of the pixel point region by using a connected domain.
Further, the value ranges of the first threshold, the second threshold and the third threshold are all [0,1 ].
Further, the step 4 specifically includes:
step 41: combining the pixel point regions which are not overlapped text line regions and the pixel point regions which are overlapped text line regions;
step 42: judging adjacent information between adjacent pixel points, combining a characteristic score chart of link information between pixels in a text line region, and merging the two pixel points into a connected domain when the two pixel points are adjacent and the link information of the two pixel points is positive;
further, two adjacent pixel points mean: the difference between the two pixel points is 1-3 pixels on the X-direction pixel coordinate axis or the Y-direction pixel coordinate axis.
Step 43: and acquiring an optimal distance threshold on the variable distance threshold test set by adopting a strategy based on variable distance threshold merging and taking end-to-end detection accuracy as a basis and adopting a dynamic distance threshold searching mode, and merging if the distance between two connected domains is within the optimal distance threshold range.
According to a second aspect of the present invention, there is provided an overlapped text line positioning and segmenting device based on deep learning, comprising the following components:
an original image input means for inputting an original image containing overlapping text lines, and preprocessing the original image;
the characteristic score graph output component is used for inputting the preprocessed original image into a trained example segmentation full convolution neural network and outputting a non-overlapping text line region characteristic score graph, an overlapping text line region characteristic score graph and a link information characteristic score graph among text line region pixels;
the outline acquisition component is used for acquiring outlines of the non-overlapped text line region and the overlapped text line region based on the non-overlapped text line region feature score map, the overlapped text line region feature score map and the link information feature score map among the text line region pixels by a connected domain analysis method;
the region merging component is used for merging the non-overlapped text line region into the overlapped text line region according to the outlines of the non-overlapped text line region and the overlapped text line region;
and the result output component is used for performing quadrilateral fitting on the text line region to obtain an external quadrilateral of the text line region and realize the positioning segmentation of the overlapped text lines.
According to a third aspect of the present invention, there is provided a deep learning based overlapping text line location segmentation system, the system comprising:
a processor and a memory for storing executable instructions;
wherein the processor is configured to execute the executable instructions to perform the method of deep learning based overlapping text line location segmentation according to any of the preceding aspects.
According to a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program,
the computer program, when executed by a processor, implements a method of deep learning based overlapping text line localization segmentation as described in any of the preceding aspects.
The invention has the beneficial effects that:
1. the method based on deep learning full convolution network instance segmentation automatically extracts and learns the image characteristics, and avoids difficult manual rule design and complex pre-processing and post-processing flows;
2. the method can adapt to different types of document images and different types of text line overlapping styles, and solves the problem that the traditional method cannot solve;
3. the designed full convolution network finally outputs Score Maps, namely, a Score map, which represents the predicted confidence coefficient, and the confidence coefficient can effectively guide the subsequent recognition and even structuring work;
4. the method is simple and efficient, the whole process consists of an FCN network and simple and efficient post-processing logic, and the requirements of practical application are met.
5. The marking training process is simple, and the training process can be efficiently completed on the premise of not needing to specially mark the overlapped area.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 illustrates a common prior art overlapped text line in a marketplace;
FIG. 2 illustrates a flow chart of a method for deep learning based segmentation of overlapping text line locations in accordance with the present invention;
FIG. 3 illustrates an example of overlapping text lines in the method for segmentation based on deep learning for locating overlapping text lines according to the present invention;
FIG. 4 is a graph illustrating feature scores of non-overlapping text line regions in the method for segmentation of overlapping text line positioning based on deep learning according to the present invention;
FIG. 5 is a diagram illustrating feature scores of overlapped text line regions in the method for positioning and segmenting overlapped text lines based on deep learning according to the present invention;
FIG. 6 is a schematic diagram illustrating an optimal threshold search process in the method for segmentation of overlapped text lines based on deep learning according to the present invention;
FIG. 7 is a schematic diagram illustrating an exemplary enclosing quadrilateral of an overlapped text line in the method for positioning and segmenting the overlapped text line based on deep learning according to the present invention;
FIG. 8 is a diagram illustrating the effect of the method for locating and segmenting the overlapped text lines based on deep learning according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
A plurality, including two or more.
And/or, it should be understood that, for the term "and/or" as used in this disclosure, it is merely one type of association that describes an associated object, meaning that three types of relationships may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.
The invention relates to a method for positioning and segmenting overlapped text lines based on deep learning, which comprises the following steps:
inputting an original image containing overlapped text lines, and preprocessing the original image;
inputting the preprocessed original image into a trained example segmentation full convolution neural network, and outputting a non-overlapping text line region feature score image, an overlapping text line region feature score image and a link information feature score image among text line region pixels;
acquiring outlines of the non-overlapping text line region and the overlapping text line region by a connected domain analysis method based on the non-overlapping text line region feature score map, the overlapping text line region feature score map and a link information feature score map among text line region pixels;
combining the non-overlapped text line region to the overlapped text line region according to the outlines of the non-overlapped text line region and the overlapped text line region;
and performing quadrilateral fitting on the text line region to obtain an external quadrilateral of the text line region, and realizing the positioning segmentation of the overlapped text lines.
Examples
The invention relates to a precise overlapping text line positioning and dividing method. Aiming at the problem of detection of overlapped text lines in bills, forms and document images, the team creatively adopts an example segmentation full convolution neural network model, integrates detection of text areas and segmentation tasks of different text line example areas in a neural network, adopts an innovative post-processing method, and combines a text line segmentation graph and an overlapped area segmentation graph to generate a complete and accurate text line example outline so as to accurately determine coordinates of different text line areas.
The specific flow is shown in fig. 2:
the first step of image preprocessing: the important thing for preprocessing the input image is the boundary alignment, so that the width and height of the image can be downsampled without influence, and the value of the alignment boundary generally coincides with the value of downsampling. For example where the present embodiment downsamples are 1/16, the boundary alignment is 16 units or pixels, or an integer multiple of 16, such as 32,64, etc.
Second step text instance segmentation at pixel level: and sending the preprocessed image into a trained example segmentation full convolution neural network, and outputting a plurality of feature Score Maps (Score Maps) which respectively represent feature Maps of link information among pixels of a background, a text region, an overlapped text region and the text region.
Training procedure for example split full convolutional networks: firstly, labeling each sample in a sample set, wherein the labeled content mainly comprises a text line area outline represented by a quadrangle, and special labeling is not needed for the overlapped text line area, so that a label file is generated. The file and the image file are fed into a full convolution example segmentation network for training. In order to complete supervision and learning of the overlapped text line regions, in network preprocessing, contour information of the overlapped regions can be automatically calculated according to contours of text line examples in the label files, the contour information is sequentially used as supervision targets of the overlapped regions, and training tasks are jointly completed by combining targets of non-overlapped regions. After one round of training is completed, the accuracy of the whole text line detection segmentation needs to be evaluated on the test set according to the trained model, wherein the accuracy evaluation also includes the accuracy evaluation of the overlapping region. If the expected effect and accuracy index are achieved, the training process may be terminated, and the model used as a prediction; if not, increasing the training sample size, adjusting the full-volume example segmentation network model structure, training parameters and the like possibly, and repeating the training process until the evaluated performance meets the requirements. The adjustment of the possible model structure generally comes from two aspects, namely the adjustment of the model capacity on one hand, and the aim is to improve the feature learning capability of the model, including the adjustment of the number of layers of the convolutional neural network, the number of filters of the convolutional operation of each layer, the feature map fusion mode, the style of the nonlinear activation function and the like; another aspect is the tuning of the generalization capability of the model, such as the tuning of the regularization term parameters in the network, with the goal of improving the performance of the model on the test set (i.e., the unlearned samples). The possible adjustment of the training parameters generally includes several aspects, on one hand, the adjustment of the super parameters of the training process, such as the adjustment of the learning rate attenuation strategy and the initial size, the adjustment of the training batch size and the whole iteration number, etc.; on the other hand, the adjustment of the training loss function includes the adjustment of the loss function style and the super parameters involved in the loss function.
Taking fig. 3 as an example, the following describes the feature score map of the non-overlapping text line region and the feature score map of the overlapping text line region.
1) The non-overlapping text line region Score Map, as shown in fig. 4, has each pixel value representing the confidence that the pixel is located inside the text line region, normalized in the [0,1] interval.
2) The overlapping text line region Score Map, as shown in fig. 5, has a pixel value representing the confidence that the pixel is located in the overlapping text region, also normalized to the [0,1] interval.
The third step extracts the text region and overlap region contours: on the basis of a connected domain analysis method, profile information of a non-overlapped text line region and an overlapped text line region is obtained by innovatively combining a non-overlapped text line region, an overlapped text line region Score maps and a Score maps of inter-pixel link information in a text region.
And step four, combining the outlines of the overlapped text line regions into the outlines of the non-overlapped text line regions: the step completes the merging work of the outline graph of each text line instance and the outline graph of the overlapped area, and the process adopts a method of connected domain analysis and multi-connected domain merging to generate a complete outline of each text line area. In the connected domain analysis process, adjacent position information between pixels and link information between adjacent pixels predicted by a full convolution neural network are creatively combined, namely, only when two pixels are adjacent and the link information of the predicted pixels is positive, the two pixels are considered to be combined into a connected domain. In the process of merging the multiple connected domains, a strategy based on variable distance threshold merging is innovatively adopted, and the optimal distance threshold is obtained on a test set by adopting a mode of dynamically searching the distance threshold on the basis of end-to-end detection precision. When the optimal threshold is determined, the merge operation is performed as long as the distance between connected domains is within the threshold. As shown in fig. 6, the optimal threshold search process specifically includes:
and testing the full convolution example segmentation network through a variable distance threshold test set, setting a distance threshold search range interval, such as [0, 5], and performing optimal distance threshold search to obtain an optimal distance threshold.
The step of setting the distance threshold search range interval to perform the optimal distance threshold search comprises the following steps:
traversing a threshold interval, wherein the step length is 1; applying the threshold to perform text line outline combination; calculating the integral detection and segmentation precision of the test set; and storing the maximum precision and the optimal threshold, and taking the threshold as the optimal distance threshold.
Experiments show that by combining the two innovative methods, the algorithm shows higher precision on the combination of the overlapped text outline and the text outline.
And fifthly, carrying out quadrilateral fitting on the text line example area to obtain a circumscribed quadrilateral of the text line. As shown in fig. 7.
Fig. 8 shows some effect diagrams of the positioning and segmentation of the overlapped text lines by using the technical solution of the present invention. Experiments show that the method can effectively solve the problem of positioning and dividing the overlapped text lines and can complete the tasks which cannot be completed by the traditional method. In addition, good algorithm performance can be achieved only by matching less training data and training iteration rounds with simple post-processing, and therefore, the method has a very high practical application value.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the above implementation method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation method. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.