CN113642401A

CN113642401A - Document line segmentation and classification method and system based on deep learning network

Info

Publication number: CN113642401A
Application number: CN202110790181.2A
Authority: CN
Inventors: 汪昕; 郭骏; 闫科萍; 潘正颐; 侯大为
Original assignee: Changzhou Weiyizhi Technology Co Ltd
Current assignee: Changzhou Weiyizhi Technology Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-11-12

Abstract

The invention provides a document line segmentation and classification method and system based on a deep learning network, which comprises the following steps: step M1: establishing a deep learning network model capable of segmenting texts; step M2: training a deep learning network model by utilizing the synthetic text picture to obtain a trained deep learning network model; step S3: and performing line segmentation and classification on the documents by using the trained deep learning network model. The method generates the synthetic text through an A-Res algorithm by combining the sample prior probability, so that the deep learning network model completes training on the synthetic text, and has a better line segmentation effect than a histogram method; compared with the method for completing the deep model training by marking data, the method is lower in labor cost.

Description

Document line segmentation and classification method and system based on deep learning network

Technical Field

The invention relates to the technical field of deep learning-computer vision, in particular to a document line segmentation and classification method and system based on a deep learning network, and more particularly to a deep learning network for line segmentation and classification of documents.

Background

Document line detection is an important sub-direction of the OCR domain, and its task is to locate the upper and lower boundaries of a document line and label its category. Unlike the general target detection task, there is a clear regularity in the input data. A complete set of algorithm application flows generally includes: collecting document pictures, marking the pictures, training a model and deploying the model.

In the field of current document line detection, in order to obtain higher line segmentation accuracy, a huge neural network structure is generally adopted, and the bottleneck is that a large number of parameters need to be trained by using massive labeled samples to participate in fitting; in addition, since the training set is always a subset of the real samples, according to the independent and identically distributed assumption of machine learning, it is often necessary to invest manpower again to label data in order to generalize the model into new unlabeled samples.

Patent document CN112257586A (application number: 202011135858.0) discloses a true value frame selection method, device, storage medium, and apparatus in target detection, and belongs to the technical field of image processing. The method comprises the following steps: acquiring a target characteristic diagram obtained after characteristic extraction is carried out on an image, wherein the target characteristic diagram comprises a plurality of grids with preset sizes; acquiring a plurality of detection frames corresponding to each small target object in the image in the target characteristic diagram; for each detection frame, calculating the centrality score of the grid with the predetermined points in the detection frame, wherein the predetermined points are corner points and/or central points of the grid; and for each small target object, determining the detection frame corresponding to the maximum centrality score as a truth frame of the small target object from a plurality of detection frames corresponding to the small target object. Compared with the patent, the application object of the invention is the line segmentation and classification of the document, the line segmentation and classification particularity of the document is that the line segmentation only considers the segmentation position in the vertical direction, and the particularity causes that the application object is the target detection frame scheme which cannot be transferred to the invention, because the scheme of the target detection frame considers the segmentation positions in the horizontal direction and the vertical direction at the same time, and the invention specially designs the following characteristics aiming at the line segmentation and classification of the document: and removing root number evolution, and setting the result in the horizontal direction as 1 to ensure that the calculation of the centrality is not influenced by different tasks.

The traditional modeling ideas generally have two types, two-stage type and one-stage type. The two-stage method is characterized in that the segmentation of text lines is finished by using an algorithm of a horizontal projection histogram, and then the classification of the text lines is finished by using a statistical model or a depth model, and the method has the defects that the effect of the horizontal projection histogram is easily interfered by noise, and the result is not robust; the one-stage method is to label the whole image and then use a uniform depth model to complete the segmentation and classification of different text lines on one image, and the classical method includes fast R-CNN, R-FCN, YOLO and FCOS.

For a document line segmentation and classification task, the traditional two-stage modeling is to split the document line segmentation and classification task into two subtasks. The method is characterized in that the horizontal projection histogram algorithm is easily interfered by noise, and results with low confidence coefficient are generated. Fig. 1 is a visualization result of the horizontal projection histogram algorithm on a line segmentation task.

The traditional one-stage modeling completely adopts deep model training and end-to-end modeling, and the general process is as follows: collecting data, marking data, training a model and predicting the model. However, the depth model contains a large number of parameters to be fitted, and therefore a large amount of data needs to be annotated, while the cost of purely human annotation data is very large. In addition, machine learning requires that data be independently and uniformly distributed, which means that data with various distributions need to be labeled to meet the generalization requirement, and this again increases the cost of human labeling.

Here, the problem to be solved is how to ensure the accuracy of model prediction and minimize the labeling cost. The method adopted by the invention is to use the synthetic text as a training set, so that the problem is converted into how to ensure the generalization performance of the model trained on the synthetic text, and a specific solution is introduced below.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a document line segmentation and classification method and system based on a deep learning network.

The invention provides a document line segmentation and classification method based on a deep learning network, which comprises the following steps:

step S1: establishing a deep learning network model capable of segmenting texts;

step S2: training a deep learning network model by utilizing the synthetic text picture to obtain a trained deep learning network model;

step S3: and performing line segmentation and classification on the documents by using the trained deep learning network model.

Preferably, the deep learning network model comprises: the method comprises the steps of presetting a number of Convolation layers with ReLU activation functions, presetting a number of BN layers, presetting a number of BiLSTM layers, presetting a number of Convolation layers and presetting a number of full-connection layer branches.

Preferably, in the deep learning network model: performing dimensionality reduction on a picture with a preset size through a Convolation layer with a ReLU activation function in a preset layer and a BN layer; inputting the feature graph after the dimension reduction processing into a BiLSTM layer with a preset number in a behavior time step, and reserving the output of the last time step; and transforming the output feature graph to obtain a feature graph with a preset size, extracting features through a preset number of Convolation layers to obtain an extracted feature graph, and generating output results of positioning, classification and centrality respectively by utilizing the extracted feature graph through three full-connection layer branches.

Preferably, the centrality is taken as:

wherein, t^*Representing the distance between the pixel point and the top of the group Truth frame; b^*And the distance between the pixel point and the bottom of the group Truth frame is represented.

Preferably, the composite text picture includes: through offline statistics, the probability distribution of each element in different documents, including font, character size, paragraph spacing and line spacing, is obtained; and generating a composite text picture with consistent distribution by using an A-Res algorithm in combination with the prior probability.

The invention provides a document line segmentation and classification system based on a deep learning network, which comprises the following steps:

module M1: establishing a deep learning network model capable of segmenting texts;

module M2: training a deep learning network model by utilizing the synthetic text picture to obtain a trained deep learning network model;

module M3: and performing line segmentation and classification on the documents by using the trained deep learning network model.

Preferably, the deep learning network model comprises: the preset layer comprises a Convolation layer, a BN layer, a preset number of BilSTM layers, a preset number of Convolation layers and a preset number of full-connection layer branches of a ReLU activation function.

Preferably, the centrality is taken as:

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a novel deep learning network model structure, which introduces the idea of centrality in a text line detection task and ensures the accuracy of segmentation;

2. a synthesized text is generated through an A-Res algorithm by combining with the sample prior probability, so that the deep learning network model completes training on the synthesized text, and the method has a better line segmentation effect than a histogram method; compared with the method for completing the deep model training by marking data, the method has the advantages that the labor cost is lower;

3. by providing a lightweight, end-to-end deep text detection network, the invention can achieve 3FPS using a GPU speed of Tesla K80 for 1024 x 724 picture input.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a diagram illustrating the line segmentation effect of the horizontal projection histogram algorithm.

Fig. 2 is a schematic diagram of the line segmentation effect of the depth model according to the present invention.

Fig. 3 is a schematic structural diagram of a deep learning network model.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example 1

step M1: establishing a deep learning network model capable of segmenting texts;

step M2: training a deep learning network model by utilizing the synthetic text picture to obtain a trained deep learning network model;

Specifically, the deep learning network model in step S1 includes: the preset layer comprises a Convolation layer, a BN layer, a preset number of BilSTM layers, a preset number of Convolation layers and a preset number of full-connection layer branches of a ReLU activation function.

Specifically, the deep learning network model in step S1 includes: as shown in fig. 2, a grayscale picture with size 1024 × 724 is passed through two convergence layers with ReLU activation function and one BN layer; the purpose of this step is to reduce the dimensions, because the original picture is too large to be directly input into the BilSTM layer; then, the generated 4 × 256 × 181 feature map is input to two layers of BilSTMs in row time steps, and the output of the last time step is reserved; and (3) deforming the output 1024 × 1024 characteristic diagram to obtain 16 × 256 characteristic diagram, extracting characteristics through 4 layers of Convolation layers to obtain 256 × 8 characteristic diagram, and generating output results of positioning, classification and centrality through three fully-connected layer branches.

In particular, the common text detection model requires only two outputs, namely, localization and classification, where a centrality branch is added. The idea of centrality branching is proposed in the FCOS paper for the first time, and centrality reflects the distance of a pixel from the center of a Ground Truth, and is finally used as the weight of positioning loss. By the method, the segmentation precision of the model is greatly enhanced. The original formula is as follows,

here, |^*,t^*,r^*And b^*The position of a target detection box is uniquely determined, and the text line segmentation task is introduced into the text line segmentation task of the invention, only proper simplification needs to be carried out, and a simplified centrality calculation formula is shown below.

The application object of the invention is the line segmentation and classification of the document, the line segmentation and classification particularity of the document is that the line segmentation only considers the segmentation position in the vertical direction, and aiming at the line segmentation and classification particularity of the document, the root number evolution is removed based on the existing centrality calculation mode, and the result in the horizontal direction is set to be 1, thereby ensuring that the centrality calculation is not influenced by different tasks.

In an experiment, the weight of different position points cannot be well distinguished by the formula (2), so that the t is combined with the idea of a softmax classifier on the basis of the formula (2)^*And b^*Changed to exp (t)^*) And exp (b)^*) The influence of the central point on the final prediction is strengthened. Experiments prove that the method is effective in our scene. The modified formula is as follows:

specifically, the step S2 of synthesizing the text picture includes: the synthesized text data must ensure that the synthesized text is as close to the actual text as possible. The text, text size, paragraph spacing, and line spacing may be different for different documents, and the occurrence of these elements that make up the document follows a certain probability distribution. Through offline statistics, the probability distribution of each element in different documents, including font, character size, paragraph spacing and line spacing, is obtained; the A-Res algorithm is used in combination with the prior probability to generate synthetic text pictures with consistent distribution, and the method is proved to be effective in the later experimental results.

Specifically, the deep learning network model in the module M1 includes: the preset layer comprises a Convolation layer, a BN layer, a preset number of BilSTM layers, a preset number of Convolation layers and a preset number of full-connection layer branches of a ReLU activation function.

Specifically, the deep learning network model in the module M1 includes: as shown in fig. 2, a grayscale picture with size 1024 × 724 is passed through two convergence layers with ReLU activation function and one BN layer; the purpose of this step is to reduce the dimensions, because the original picture is too large to be directly input into the BilSTM layer; then, the generated 4 × 256 × 181 feature map is input to two layers of BilSTMs in row time steps, and the output of the last time step is reserved; and (3) deforming the output 1024 × 1024 characteristic diagram to obtain 16 × 256 characteristic diagram, extracting characteristics through 4 layers of Convolation layers to obtain 256 × 8 characteristic diagram, and generating output results of positioning, classification and centrality through three fully-connected layer branches.

specifically, the synthesizing of the text picture in the module M2 includes: the synthesized text data must ensure that the synthesized text is as close to the actual text as possible. The text, text size, paragraph spacing, and line spacing may be different for different documents, and the occurrence of these elements that make up the document follows a certain probability distribution. Through offline statistics, the probability distribution of each element in different documents, including font, character size, paragraph spacing and line spacing, is obtained; the A-Res algorithm is used in combination with the prior probability to generate synthetic text pictures with consistent distribution, and the method is proved to be effective in the later experimental results.

In the experiment, two English novels are used as real samples, as shown in FIG. 2 and tables 1 and 2, the trained deep learning network model obtains 98.1% of line segmentation accuracy and 99.5% of classification accuracy on Pride and Prejudge; line segmentation accuracy Of 98.5% and classification accuracy Of 99.7% were obtained on The sample "The Secret Of Plato's Atlantis".

Table 1: comparison of precision on Pride and Prejudge by different methods

Table 2: comparison Of The precision Of The different methods on The Secret Of Plato's Atlantis

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A document line segmentation and classification method based on a deep learning network is characterized by comprising the following steps:

2. The deep learning network-based document line segmentation and classification method according to claim 1, wherein the deep learning network model comprises: the method comprises the steps of presetting a number of Convolation layers with ReLU activation functions, presetting a number of BN layers, presetting a number of BiLSTM layers, presetting a number of Convolation layers and presetting a number of full-connection layer branches.

3. The deep learning network-based document line segmentation and classification method according to claim 2, wherein in the deep learning network model: performing dimensionality reduction on a picture with a preset size through a Convolation layer with a ReLU activation function in a preset layer and a BN layer; inputting the feature graph after the dimension reduction processing into a BiLSTM layer with a preset number in a behavior time step, and reserving the output of the last time step; and transforming the output feature graph to obtain a feature graph with a preset size, extracting features through a preset number of Convolation layers to obtain an extracted feature graph, and generating output results of positioning, classification and centrality respectively by utilizing the extracted feature graph through three full-connection layer branches.

4. The deep learning network-based document line segmentation and classification method according to claim 3, wherein the centrality is as follows:

5. The deep learning network-based document line segmentation and classification method according to claim 1, wherein the composite text picture comprises: through offline statistics, the probability distribution of each element in different documents, including font, character size, paragraph spacing and line spacing, is obtained; and generating a composite text picture with consistent distribution by using an A-Res algorithm in combination with the prior probability.

6. A system for document line segmentation and classification based on a deep learning network, comprising:

7. The deep learning network-based document line segmentation and classification system of claim 6, wherein the deep learning network model comprises: the preset layer comprises a Convolation layer, a BN layer, a preset number of BilSTM layers, a preset number of Convolation layers and a preset number of full-connection layer branches of a ReLU activation function.

8. The deep learning network-based document line segmentation and classification system of claim 7, wherein in the deep learning network model: performing dimensionality reduction on a picture with a preset size through a Convolation layer with a ReLU activation function in a preset layer and a BN layer; inputting the feature graph after the dimension reduction processing into a BiLSTM layer with a preset number in a behavior time step, and reserving the output of the last time step; and transforming the output feature graph to obtain a feature graph with a preset size, extracting features through a preset number of Convolation layers to obtain an extracted feature graph, and generating output results of positioning, classification and centrality respectively by utilizing the extracted feature graph through three full-connection layer branches.

9. The deep learning network-based document line segmentation and classification system according to claim 8, wherein the centrality is using:

10. The deep learning network-based document line segmentation and classification system of claim 6, wherein the composite text picture comprises: through offline statistics, the probability distribution of each element in different documents, including font, character size, paragraph spacing and line spacing, is obtained; and generating a composite text picture with consistent distribution by using an A-Res algorithm in combination with the prior probability.