CN111881768A

CN111881768A - Document layout analysis method

Info

Publication number: CN111881768A
Application number: CN202010637093.4A
Authority: CN
Inventors: 王波; 张百灵; 周炬; 朱华柏
Original assignee: Auntec Co ltd
Current assignee: Auntec Co ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-11-03

Abstract

The invention discloses a document layout analysis method, which comprises the steps of zooming an input layout image into images of 3 scales; extracting and fusing the features of the images of all scales; sending the fused image features to a segmentation network backbone for extracting semantic information features; the method comprises the steps of up-sampling high-layer low-resolution features with high semantic information, and fusing the up-sampled high-layer low-resolution features with rich spatial detail information with the low-layer high-resolution features with rich spatial detail information; and setting corresponding division network branches for division identification according to the attributes of different layout elements, and simultaneously restoring the output characteristic image to a pre-specified resolution ratio to finish document layout analysis. By adopting the technical scheme of the invention, the multi-scale input images can be fused, the adaptability of the segmentation network to the input images with different scales is increased, the influence of the input image scaling operation on the model is reduced, different segmentation network branches are added aiming at different attributes of layout elements, and the mutual influence of different layout elements is reduced.

Description

Document layout analysis method

Technical Field

The invention relates to the technical field of optical character recognition, in particular to a document layout analysis method.

Background

Layout analysis is one of the basic steps of an Optical Character Recognition (OCR) system, and is a process of analyzing, recognizing and understanding the image, text, table features and positional relationships in the layout of a document. The quality of the layout analysis result directly affects the performance of the OCR follow-up module, and with the development of deep learning, a document layout analysis system based on deep learning gradually becomes a mainstream method.

The image semantic segmentation technology has the recognition and positioning capacity of pixel level, so that the method is very suitable for the document layout analysis task. As is known, a character is a sparse non-rigid structure, and has large scale change, a complex structure, a wide variety of types and extremely rich semantic information. Therefore, compared with the image processing process of a general object, the document layout is more sensitive to the zooming operation of the image, and if the operation is improper, the characters are seriously deformed and blurred, and even the semantic information contained in the characters is lost. These reasons lead to the fact that the document layout analysis method based on semantic segmentation requires higher resolution of both the input image and the output feature map to ensure higher accuracy. However, high-resolution document image layout analysis not only increases the complexity of the deep neural network model, but also increases the computational load and video memory requirements thereof.

On the other hand, the structure of the document layout is very complex, and most documents have the phenomena that different layout elements are nested and overlapped with each other in a crossed manner. For example, the complex image is used as the page background of characters, etc., the table contains images, the handwritten fonts and the printed fonts are mixed, and dark watermarks, seals, character icons, etc. exist in the page. However, the labeling mode of the text data mostly follows the labeling method of the general target detection, and the large-area block-shaped labeling of the rectangular box is used. Although the labeling method is simple and convenient and has low cost, the labeling method is not suitable for data labeling applied to image semantic segmentation, and the precision of model training is reduced. The general mode of using polygon annotation semanteme to cut apart data can seriously increase the mark cost, and a pixel still can only match a label moreover to do not solve the problem that layout element alternately overlaps, these phenomena can all lead to layout element interact finally, and the precision is low, cut apart the layout in a jumble and small fragment and irregularly.

Disclosure of Invention

In order to solve the problems in the related art, embodiments of the present invention provide a document layout analysis method, which can fuse multi-scale input images, increase the adaptability of a segmentation network to input images of different scales, reduce the influence of an input image scaling operation on a model, increase different segmentation network branches for different attributes of layout elements, and reduce the mutual influence of different layout elements.

The embodiment of the invention provides a document layout analysis method, which comprises the following steps:

zooming the input layout image into an image with 3 scales;

extracting and fusing the features of the images of all scales;

sending the fused image features to a segmentation network backbone for extracting semantic information features;

the method comprises the steps of up-sampling high-layer low-resolution features with high semantic information, and fusing the up-sampled high-layer low-resolution features with rich spatial detail information with the low-layer high-resolution features with rich spatial detail information;

and setting corresponding division network branches for division identification according to the attributes of different layout elements, and simultaneously restoring the output characteristic image to a pre-specified resolution ratio to finish document layout analysis.

The method for scaling the input layout image into the image with 3 scales further comprises the following steps:

the input layout image is subjected to scaling operations of 2 times and 0.5 time, and images of 3 scales are obtained.

The method for extracting and fusing the features of the multi-scale text image further comprises the following steps:

the layout image of 2 times of scale is sampled by a convolution layer of 3 multiplied by 3 with 16 output characteristic channels and 2 step length;

carrying out feature vector splicing with the 3 multiplied by 3 convolution features of which the number of output feature channels of the layout image with the original scale is 32 and the step length is 1;

carrying out first feature fusion by using 13 multiplied by 3 convolutional layer with 64 output feature channels and 1 step length;

using 13 multiplied by 3 convolution layer with 64 output characteristic channels and 2 step length to carry out down sampling;

carrying out feature vector splicing with the 3 x 3 convolution features of which the number of output feature channels of the layout image with the 0.5-time scale is 16 and the step length is 1;

performing second feature fusion by using 13 × 3 convolutional layer with 64 output feature channels and 1 step length;

downsampling is performed using 13 × 3 convolutional layer with 64 output eigen-channels and 2 step size.

Further, when the fused image features are sent to a segmentation network backbone, the resolution is 1/4 of the resolution of the layout image with the original scale, and the number of output feature channels is 64.

Further, the main part of the segmentation network is a residual network, the dense hollow pyramid pooling module is used at the top end of the residual network to extract the convolution characteristics of the multi-scale layout image, the number of output characteristic channels after extraction is 256, and the resolution is 1/32 of the original-scale layout image resolution.

The method comprises the following steps of up-sampling the high-layer low-resolution features with high semantic information, and fusing the up-sampled high-layer low-resolution features with the low-layer high-resolution features rich in spatial detail information, and further comprises the following steps:

carrying out 8 times bilinear interpolation up-sampling on the high-layer low-resolution features of the high semantic information, and simultaneously carrying out feature smoothing and channel dimensionality reduction on the low-layer high-resolution features through a 1 × 1 convolutional layer with the output feature channel number of 32 and the step length of 1;

in the process of fusing with low-level high-resolution features with abundant spatial detail information, fusing the up-sampled high-level features and the low-level features by using a feature vector splicing mode and 13 x 3 convolutional layer, wherein the number of output feature channels after fusion is 320, and the resolution is 1/4 of the original-scale layout image resolution;

then 3 convolution layers with the output characteristic channel number of 64 and the step length of 1 are respectively used as the heads of 3 different segmentation network branches to extract the characteristics belonging to different object attributes;

then, sampling bilinear interpolation to up-sample the resolution of the features to the pre-specified resolution;

and finally, using 1 convolution layer with the number of output characteristic channels of 64 and the step size of 1 and 1 convolution layer with the number of channels of 1 as the division identification category number of the branch of the division network and the step size of 1 as the top identification structure of the division network.

Furthermore, after all the convolution layers, a regularization BN layer and an activation function ReLU layer are connected.

Further, the high-level features have the same resolution as the low-level features after being upsampled.

Further, the segmentation network branches feature extraction and channel dimension reduction using 1 convolutional layer, upsampling to a pre-specified resolution using bilinear interpolation, and using 13 × 3 convolutional layer and 1 × 1 convolutional layer as top identification structures of the segmentation network.

Further, the number of division recognition classes of the three division network branches is 2.

The technical scheme provided by the embodiment of the invention has the following beneficial effects: because the input images of various scales are fused, the adaptability of a segmentation network to the input images of different scales is increased, and the influence of the scaling operation of the input images on the model is reduced; in addition, aiming at different attributes of layout elements, different division network branches are added, so that the mutual influence of different elements is reduced, the division of cross overlapping elements is more convenient, and the network has the capability of identifying multi-class label elements; meanwhile, the method is more beneficial to the post-processing of the segmentation result.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flowchart of document layout analysis in an embodiment of the present invention.

FIG. 2 is a flowchart of feature extraction and fusion for an image according to an embodiment of the present invention.

FIG. 3 is a flow chart of the fusion of high-level features with low-level features in an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus, and associated applications, methods consistent with certain aspects of the invention, as detailed in the following claims.

The technical scheme of the invention is to solve the problems that complex document layouts are very sensitive to image scaling, input images and output characteristics with higher resolution are needed to retain more detailed information, and meanwhile, labeling modes of layout data cause serious mutual interference of different layout elements and disorder and fine segmentation layouts, and the multi-task layout segmentation network MLSNet for multi-scale input images is provided.

FIG. 1 is a flowchart of document layout analysis in an embodiment of the present invention. As shown in fig. 1, the document layout analysis process includes the following steps:

step 10, firstly, the same input layout image is zoomed into images with 3 scales.

Specifically, the step is to specify the sizes of an input layout image and an output feature image, and then perform scaling operations of 2 times and 0.5 times on the input layout image respectively. For example, the input RGB image has 3 dimensions such as 1536 × 2048, 768 × 1024, 384 × 512, and the output feature image has a size of 1024 × 1536.

And 11, extracting and fusing the features of the images of all scales.

As shown in fig. 2, the present step further comprises the steps of:

step 111, the input 1536 × 2048-scale layout image is first down-sampled by a 3 × 3 convolution layer with an output characteristic channel number of 16 and a step size of 2(stride 2).

And step 112, performing feature vector splicing with the input 3 × 3 convolution features of the layout image with 768 × 1024 scales, wherein the output feature channel number is 32 and the step length is 1(stride is 1).

Step 113 is followed by performing a first feature fusion using 13 × 3 convolutional layer with an output feature channel number of 64 and a step size of 1(stride 1).

Step 114, the downsampling is performed again using 13 × 3 convolutional layer with the output eigen channel number of 64 and the step size of 2(stride 2).

And step 115, performing feature vector splicing with the input 3 × 3 convolution features of the layout image with 384 × 512 scales, wherein the number of output feature channels of the layout image is 16 and the step size is 1(stride is 1).

And step 116, finally, performing second-time feature fusion by using 13 × 3 convolutional layer with the output feature channel number of 64 and the step size of 1(stride 1).

And step 117, performing downsampling by using 13 × 3 convolutional layer with the output characteristic channel number of 64 and the step size of 2(stride is 2).

After the extraction and fusion of the features, when the image features are sent to the main trunk of the segmentation network, the resolution is 1/4 which is the resolution (768 × 1024) of the layout image of the original scale, the number of output feature channels is 64, and the resolution is higher.

And step 12, sending the fused image features to a segmentation network backbone for extracting semantic information.

In the embodiment, the main trunk of the segmentation network is a residual error network (resnet-50), a dense void pyramid pooling module (denseas pp) is used at the top of the residual error network to extract the convolution features of the multi-scale layout image, and after extraction, 1/32 with the number of output feature channels being 256 and the resolution being the resolution (768 × 1024) of the original-scale layout image is output.

And step 13, performing upsampling on the high-level features with high semantic information, wherein the high-level features have the same resolution as the low-level features after the upsampling, and fusing the upsampled high-level features and the low-level features with rich spatial detail information by using a feature vector splicing mode and 13 x 3 convolutional layer.

As shown in fig. 3, the fusion process includes the following steps:

step 131, 8 times of bilinear interpolation upsampling is performed on the high-layer low-resolution feature of the high semantic information, and meanwhile, feature smoothing and channel dimensionality reduction are performed on the low-layer high-resolution feature through a 1 × 1 convolutional layer with an output feature channel number of 32 and a step length of 1(stride ═ 1).

And step 132, in the process of fusing with the low-level high-resolution features with rich space detail information, fusing the up-sampled high-level features and the low-level features by using a feature vector splicing mode and 13 × 3 convolutional layer, wherein the number of output feature channels after fusion is 320, and the resolution is 1/4 of the resolution (768 × 1024) of the original-scale layout image.

Step 133 then extracts features belonging to different object attributes using 3 × 3 or 5 × 5 convolutional layers with an output feature channel number of 64 and a step size of 1(stride 1), respectively, as headers of 3 different split network branches.

Step 134, next, sample bilinear interpolation upsamples the resolution of the feature to a pre-specified resolution (1024 x 1536).

Step 135, finally, using 1 convolution layer with output characteristic channel number of 64 and step size of 1(stride 1) and 1 channel number as the division identification category number of the division network branch and step size of 1

The 1 × 1 convolution layer of (stride 1) is used as the top identification structure of the split network.

And after all the convolution layers, connecting a regularization BN layer and an activation function ReLU layer.

And step 14, finally, setting corresponding segmentation network branches for segmentation identification according to the attributes of different layout elements, and restoring the output characteristic image to a pre-specified resolution (1024 × 1536) in the process to finish document layout analysis.

In order to reduce the consumption of video memory, each split net branch uses 1 convolutional layer for feature extraction and channel dimension reduction, then uses bilinear interpolation for up-sampling to a pre-specified resolution (1024 × 1536), and uses 13 × 3 convolutional layer and 1 × 1 convolutional layer as the top structure of the split net. Due to the limitation of the tagging data category, the number of segmentation identification categories of the three segmentation network branches is 2(C1 ═ C2 ═ C3 ═ 2).

By adopting the embodiment of the invention, because the input images of various scales are fused, the adaptability of the segmentation network to the input images of different scales is increased, and the influence of the scaling operation of the input images on the model is reduced; in addition, aiming at different attributes of layout elements, different division network branches are added, so that the mutual influence of different elements is reduced, the division of cross overlapping elements is more convenient, and the network has the capability of identifying multi-class label elements; meanwhile, the method is more beneficial to the post-processing of the segmentation result.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A document layout analysis method is characterized by comprising the following steps:

zooming the input layout image into an image with 3 scales;

extracting and fusing the features of the images of all scales;

2. The document layout analysis method of claim 1 wherein the scaling of the input layout image into 3-scale images further comprises the steps of:

3. The document layout analysis method of claim 2, wherein the extracting and fusing the features of the multi-scale text image further comprises the following steps:

4. The document layout analysis method of claim 3 wherein the fused image features are sent to a split network backbone at a resolution 1/4 that is equal to the resolution of the original layout image, and the number of output feature channels is 64.

5. The document layout analysis method according to any one of claims 1 to 4, wherein the main skeleton of the segmentation network is a residual network, a dense hollow pyramid pooling module is used at the top of the residual network to extract the convolution features of the multi-scale layout image, and after extraction, the number of output feature channels is 256, and the resolution is 1/32 of the resolution of the original-scale layout image.

6. The document layout analysis method of claim 1 wherein said upsampling the high level low resolution features with high semantic information and fusing with the low level high resolution features with rich spatial detail information further comprises the steps of:

7. The document layout analysis method of claim 6, wherein all convolutional layers are followed by a regularized BN layer, an activation function ReLU layer.

8. The document layout analysis method of claim 6 wherein the high-level features are upsampled to the same resolution as the low-level features.

9. The document layout analysis method of claim 1 wherein the split network branches feature extraction and channel dimensionality reduction using 1 convolutional layer, upsampling to a pre-specified resolution using bilinear interpolation, using 13 x 3 convolutional layer and 1 x 1 convolutional layer as top identification structures for the split network.

10. The document layout analysis method of claim 1 wherein the number of segmentation identification categories for three segmentation network branches is 2.