CN115578741A - Mask R-cnn algorithm and type segmentation based scanned file layout analysis method - Google Patents

Mask R-cnn algorithm and type segmentation based scanned file layout analysis method Download PDF

Info

Publication number
CN115578741A
CN115578741A CN202211119268.8A CN202211119268A CN115578741A CN 115578741 A CN115578741 A CN 115578741A CN 202211119268 A CN202211119268 A CN 202211119268A CN 115578741 A CN115578741 A CN 115578741A
Authority
CN
China
Prior art keywords
image
mask
title
model
rois
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211119268.8A
Other languages
Chinese (zh)
Inventor
赵卫东
张晓明
李旭健
肖智勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN202211119268.8A priority Critical patent/CN115578741A/en
Publication of CN115578741A publication Critical patent/CN115578741A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/147Determination of region of interest
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • G06V30/18019Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections by matching or filtering
    • G06V30/18038Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters
    • G06V30/18048Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters with interaction between the responses of different filters, e.g. cortical complex cells
    • G06V30/18057Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/18105Extraction of features or characteristics of the image related to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19113Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/20Combination of acquisition, preprocessing or recognition functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors

Abstract

The invention provides a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation, and belongs to the field of deep learning. The method is mainly realized by adopting a technical scheme of type segmentation based on a Mask R-cnn algorithm, and aims to improve the accuracy of layout analysis. The method specifically comprises the following steps; taking a scanned file color image as input, firstly identifying and dividing a table, taking an image obtained after dividing the table as input, identifying and dividing an illustration, then removing red chapters, and finally identifying a title of the image obtained after removing the table, the illustration and the red chapters, and outputting an unidentified part as a text part. In the scanned file with the complex layout, the method solves the problem of low accuracy of top-down, bottom-up and comprehensive methods, so that the technologies of image classification, text processing, OCR and the like can be further optimized, and finally the analysis accuracy of the layout of the scanned file is improved.

Description

Scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation
Technical Field
The invention belongs to the field of deep learning, and particularly relates to a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation.
Background
The analysis and research on the layout of the scanned file are of great significance. First, the scanned document has the advantages of being portable, easy to store, readable but not modifiable at will, and so on, and thus the scanned document is widely used. Secondly, the electronization of the scanned file can well save the data, the layout analysis of the scanned file is an important step for forming the electronization of the scanned file, and the accuracy can be ensured to a great extent by correctly and effectively analyzing the layout of the scanned file. Meanwhile, with the continuous emergence of technologies such as image classification and text processing, the layout analysis technology becomes more important, so if the text and illustrations in the scanned document are to be processed intelligently, the layout analysis is a necessary step to solve the problem.
The layout analysis comprises three types, namely top-down type, bottom-up type and comprehensive type. Starting from the whole document image from top to bottom, different regions are segmented from the whole page by adopting a related algorithm, and the algorithm is suitable for the document image with more standard and concise layout. The algorithm from bottom to top is gradually merged into different document areas from local information through modes of connected domains and the like, the algorithm is suitable for more complex layout analysis, but the algorithm efficiency is often low, and a uniform merging rule is difficult to form.
Based on the advantages and disadvantages of the two algorithms, the comprehensive algorithm fuses the top-down algorithm and the bottom-up algorithm, and is also the most common method in layout analysis. Yang proposes a texture-based analysis method, which analyzes a document image according to the line intervals and the line texture features of an image table in the document image. Tian et al propose a bottom-up and top-down combined hybrid by performing corresponding combination according to the characteristics of the distance between the connected regions, the transverse and longitudinal arrangement of the sizes of the connected regions, the reference lines and the like. However, for a relatively complex type of a page of a scanned document, it is difficult to extract elements in the scanned document, such as red chapters and irregular tables, and these areas are difficult to divide and have low accuracy. Therefore, a method for identifying and classifying scan files more efficiently is needed.
Mask R-cnn is an algorithm used to detect objects in an image while generating a high quality segmentation Mask for each instance. This method, called masked area convolutional neural network, extends the area convolutional neural network faster by adding a branch to the existing branch to predict the target mask in parallel. However, the Mask R-cnn algorithm is used for processing the layout analysis multi-classification problem, all tables, illustrations, titles and texts are identified at the same time, the accuracy is low, and certain problems exist when the Mask R-cnn algorithm is used for multi-target detection.
Disclosure of Invention
The invention provides a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation, aiming at the problems of low accuracy and high loss rate of a scanning file in the aspect of layout analysis, and the core of the method is to provide a new method: and analyzing the layout of the scanned file by using a type segmentation method based on a Mask R-cnn algorithm and the type segmentation method, and accurately identifying the table, the illustration, the title, the characters and the red seal part respectively.
In order to achieve the purpose, the invention adopts the following technical scheme:
a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation takes a scanned original color image as input, firstly identifies and segments a table in the image, takes the image after the table is segmented as input, identifies and segments an illustration, then removes a red seal in the image, finally identifies a title of the image after the table, the illustration and the red seal are removed, and outputs an unidentified part as a text part, and specifically comprises the following steps:
step 1, defining a loading interface and a Mask R-cnn model, and carrying out model training;
step 2, acquiring an original color image of a scanned file in real time and preprocessing the original color image;
step 3, carrying out classification and identification based on the trained model, firstly identifying a table area in the color image based on the table identification model, extracting a table, and outputting the table and the image 1 after the table is extracted;
step 4, identifying an illustration area in the image 1, extracting an illustration, and outputting the illustration and the image 2 after the illustration is extracted;
step 5, removing the red seal in the image 2, and outputting to obtain an image 3;
and 6, finally, identifying and marking the title in the image 3, wherein the part which is not marked is a text, and outputting the image marked with the title.
Further, the specific process of step 1 is as follows:
step 1.1, for defining a loading interface, firstly, rewriting an acquired historical data set, disordering and sequencing the data set, dividing a training set and a test set, secondly, acquiring a mapping relation between an image and a mask, converting the mask into a Tensor, and acquiring coordinates and a label of a mask for image segmentation; finally, enhancing the image in the data set, and converting the original image into a PyTorch tensor;
step 1.2, for defining a Mask R-cnn model, acquiring a model architecture and input feature numbers of the model, and modifying parameters of a feature type and a Mask type of an output model into 2; the method also comprises a definition training and verification data loader, and a Mask R-cnn model is obtained by using an auxiliary function;
step 1.3, adopting different data sets for the recognition table, the illustration, the title and the text, respectively training models of the recognition table, the illustration and the title, and finally storing the trained model parameters, wherein the training obtains three models: the method comprises the steps of a table recognition model, an illustration recognition model and a title recognition model, wherein each recognition model is obtained by adopting Mask R-cnn for training, and a residual error network ResNet101 is selected as a main feature extraction network of the Mask R-cnn;
optimizing by adopting an Adam optimizer in model training so as to adjust model updating weight and deviation parameter theta t ,θ t The result is obtained from the formula (1),
Figure BDA0003844157470000021
wherein, theta t-1 Represents the deviation parameter obtained from the last iteration, alpha represents the default learning rate, and m i Means mean of gradient, v t Exponential moving average representing the square of the gradient, ∈ =10 -8
Wherein, the learning rate planning program is set to reduce the learning rate by 10 times every 3 epochs;
defining a multi-task loss function L for each RoI interface during model training, and continuously updating the learning rate by calling an optimizer; after training, storing the trained model parameters;
l is obtained by the formula (2),
L=L CLS +L BOX +L mask (2)
wherein L is CLS Represents a classification loss, L BOX Represents the regression frame loss, L mask Indicating a loss of newly added mask.
Further, in step 2, the original color image of the scanned file is subjected to image enhancement, and the enhanced image is converted into an RGB three-channel format.
Further, the specific process of step 3 is as follows:
step 3.1, inputting the preprocessed scanned file image into a residual error network ResNet101 and a residual error network FPN which are pre-trained to obtain a corresponding characteristic diagram, namely inputting the scanned file image into the ResNet101 to extract a table, dividing the ResNet101 into 5 stages, outputting C1-C5 corresponding to each stage, and outputting all possible table areas corresponding to the scanned file at the moment;
3.2, setting candidate coordinate ROI for each point in all possible table areas, thereby obtaining a plurality of candidate coordinate ROIs;
3.3, sending the candidate coordinate ROIs into an RPN (resilient packet network) to carry out binary classification and BB regression so as to filter out a part of the candidate coordinate ROIs; the specific process is as follows: firstly, performing convolution and pooling on the candidate coordinate ROI to continuously reduce the size of a feature map of the candidate coordinate ROI; then, carrying out deconvolution interpolation operation, continuously increasing the characteristic graph of the image, and finally classifying each pixel value so as to realize accurate segmentation of the input image;
3.4, ROIAlign operation is carried out on the remaining candidate coordinates ROI, the original image and the pixels of the feature image are firstly corresponded, then the feature image and the fixed features are corresponded, the pixel values of the coordinates of four fixed points are obtained through bilinear interpolation, the bilinear interpolation is linear interpolation in two directions, the linear interpolation is carried out in the x direction firstly, and the pixel values are obtained through formulas (3) and (4); then, interpolation is carried out in the y direction, which is obtained by the formula (5),
Figure BDA0003844157470000031
Figure BDA0003844157470000032
Figure BDA0003844157470000033
where f (P) denotes the coordinate of the evaluation function f at point P = (x, y), R 1 =(x,y 1 ),R 2 =(x,y 2 ) Assuming that the known function is at Q 11 =(x 1 ,y 1 ),Q 21 =(x 2 ,y 1 )Q 12 =(x 1 ,y 2 )Q 22 =(x 2 ,y 2 ) The value of four points;
step 3.5, classifying the candidate coordinate ROI, performing frame regression and mask prediction, returning a prediction score, coordinates and a label to a result, judging the accuracy of identification according to the score, outputting the coordinates and the label with the score larger than 85 as the result, and determining the position of the output coordinates as a real table area;
step 3.6, segmenting and storing the real table area; after the coordinates of the real table are obtained, cutting and storing are carried out through a crop () method in the PIL, and the stored table and the image 1 are output.
Further, the specific process of step 4 is as follows:
step 4.1, setting the cut part in the image 1 as a background color, circulating all pixel points of the cut part, setting each pixel point as the background color, and using the background color as the input of the identification illustration;
step 4.2, inputting the image into the illustration recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the interpolation region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN (resilient packet network) to filter out a part of the candidate coordinate ROIs, and finally, carrying out ROIAlign operation on the rest ROIs, obtaining pixel values of coordinates of four fixed points through bilinear interpolation, and determining the position of a real interpolation region through the coordinates of the four fixed points.
4.3, dividing and storing the interpolation region; and after the coordinate position of the image insertion area is acquired, cutting and storing the image through a crop () method in the PIL, and outputting the stored image and the image 2.
Further, the specific process of step 5 is as follows:
step 5.1, setting a threshold value: a parameter cv2.Threshold _ OTSU is introduced into the cv2.Threshold function, and the initial value of the threshold is set to 0, the cv2.Threshold function automatically finds the optimal threshold of the threshold;
step 5.2, removing red chapters, namely splitting three channels of an RGB image to obtain gray level images of blue, green and red channels, carrying out binarization by adopting a cv2.Threshold () method in OpenCV, and processing the red chapters of which the pixel gray level of a gray image is greater than an optimal threshold value thresh into white and the red chapters of which the pixel gray level is less than the optimal threshold value into black; and after the processing is finished, outputting the image 3 without the red seal.
Further, the specific process of step 6 is as follows:
step 6.1, setting the cut part in the image 3 as a background color, circulating all pixel points of the cut part, and setting each pixel point as the background color to be used as the input of the identification title;
step 6.2, inputting the image into the title recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic diagram, secondly, setting a candidate coordinate ROI for each point in the header region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN (resilient packet network) to filter out a part of the candidate coordinate ROIs, and finally, carrying out ROIAlign operation on the rest ROIs, obtaining pixel values of coordinates of four fixed points through bilinear interpolation, wherein the coordinates of the four fixed points are the positions of the real header region;
step 6.3, identifying the title, selecting the title in a frame mode, marking a title area with a label title, and outputting an unidentified part as a text; and finally, outputting the image marked with the title.
The invention has the following beneficial technical effects:
1. aiming at the complex layout of the scanned file, the method adopts a mask region convolution neural network to carry out layout analysis.
2. Aiming at the problem that Mask R-cnn has low accuracy in simultaneously identifying tables, illustrations, titles and characters, a type segmentation method is provided, a scanned image is used as input, firstly, the tables are identified and segmented, the images obtained after the tables are segmented are used as input, the illustrations are identified and segmented, finally, the titles of the images obtained after the tables and the illustrations are removed are identified, unidentified parts are output as text parts, and the accuracy is improved by 10% -15%.
3. The method aims at the problem that the red seal exists in part of scanned files, so that the accuracy rate is low when the titles and the texts are recognized, the red seal is removed, and the recognition accuracy is further improved.
Drawings
FIG. 1 is a flow chart of a method for analyzing the layout of a scanned document based on the Mask R-cnn algorithm and type segmentation in accordance with the present invention;
FIG. 2 is a schematic diagram of the training process of three models in the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
as shown in FIG. 1, the invention provides a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation, wherein a scanning file acquires an original color image, a table in the image is firstly identified and segmented, the image after the table is segmented is used as input, the image after the table is segmented is identified and segmented, then a red chapter in the image is removed, finally the image after the table, the image after the table and the red chapter are removed is subjected to title identification, and an unidentified part is output as a text part. The method specifically comprises the following steps:
step 1, defining a loading interface and a Mask R-cnn model, and carrying out model training. The specific process is as follows:
step 1.1, for defining a loading interface, firstly, rewriting operation is carried out on an obtained historical data set, the data set is subjected to disordering and sequencing, a training set and a test set are divided, secondly, the mapping relation between an image and a mask is obtained, the mask is converted into a Tensor, and the coordinate and the label of a mask are obtained and are used for image segmentation. And finally, performing enhancement processing on the image in the data set, and converting the original image into a PyTorch tensor.
Step 1.2, for defining a Mask R-cnn model, acquiring a model architecture and input feature numbers of the model, and modifying parameters of a feature type and a Mask type of an output model into 2; and defining a training and verification data loader, and acquiring a Mask R-cnn model by using an auxiliary function.
Step 1.3, adopting different data sets for the recognition form, the illustration, the title and the text, respectively training out models for the recognition form, the illustration and the title, and finally saving the trained model parameters, as shown in fig. 2, at this time, training obtains three models: the method comprises a table recognition model, an illustration recognition model and a title recognition model, wherein each recognition model is obtained by adopting Mask R-cnn for training, and a residual error network ResNet101 is selected as a main feature extraction network of the Mask R-cnn.
Optimization is performed by adopting an Adam optimizer in model training, so that the update weight and the deviation parameter theta of the model are adjusted t ,θ t Can be obtained from the formula (1),
Figure BDA0003844157470000051
wherein, theta t-1 Represents the deviation parameter obtained from the last iteration, alpha represents the default learning rate, and m i Means mean of gradient, v t Exponential moving average representing the square of the gradient, ∈ =10 -8 The divisor is avoided to be 0.
Wherein the learning rate planning procedure is set to reduce the learning rate by a factor of 10 every 3 epochs.
Defining a multi-task loss function L for each RoI interface during model training, and continuously updating the learning rate by calling an optimizer; and after the training is finished, storing the trained model parameters.
L can be obtained from the formula (2),
L=L CLS +L BOX +L mask (2)
wherein L is CLS Represents a classification loss, L BOX Represents the regression frame loss, L mask Indicating a loss of newly added mask.
Step 2, acquiring an original color image of a scanned file in real time, and preprocessing the original color image, wherein the preprocessing specifically comprises the following steps: and carrying out image enhancement on the original color image of the scanned file, and converting the enhanced image into an RGB three-channel format.
And 3, carrying out classification and identification based on the trained model, firstly identifying a table area in the color image based on the table identification model, extracting a table, and outputting the table and the image 1 after the table is extracted. The specific process is as follows:
step 3.1, inputting the preprocessed scanned file image into a residual error network ResNet101 and FPN which are trained to obtain corresponding feature maps, namely inputting the scanned file image into ResNet101 to extract tables, dividing ResNet101 into 5 stages, outputting C1-C5 corresponding to each stage, and outputting all possible table regions corresponding to the scanned file at the moment (at the moment, the output table regions contain regions with recognition errors, and the error recognition regions need to be further removed);
3.2, setting candidate coordinate ROI for each point in all possible table areas, thereby obtaining a plurality of candidate coordinate ROIs;
3.3, sending the candidate coordinate ROIs into an RPN (resilient packet network) to carry out binary classification (foreground or background) and BB regression so as to filter out a part of candidate coordinate ROIs; the specific process is as follows: firstly, performing convolution and pooling on the candidate coordinate ROI to continuously reduce the size of a feature map of the candidate coordinate ROI; and then carrying out deconvolution operation, namely carrying out interpolation operation, continuously increasing the characteristic diagram, and finally classifying each pixel value, thereby realizing accurate segmentation of the input image.
And 3.4, performing ROIAlign operation on the remaining candidate coordinates ROI, namely, firstly, corresponding the pixels of the original image and the feature image, then, corresponding the feature image and the fixed features, and obtaining the pixel values of the coordinates of the fixed four points through bilinear interpolation, wherein the bilinear interpolation is essentially to perform linear interpolation in two directions. Firstly, linear interpolation is carried out in the x direction, and the linear interpolation can be obtained by formulas (3) and (4); interpolation in the y-direction is then performed, as can be seen from equation (5), so that the discontinuous operation becomes continuous and the error is smaller when returning to the original.
Figure BDA0003844157470000061
Figure BDA0003844157470000062
Figure BDA0003844157470000063
Where f (P) denotes the coordinate of the function f at point P = (x, y), R 1 =(x,y 1 ),R 2 =(x,y 2 ) Assuming that the known function is at Q 11 =(x 1 ,y 1 ),Q 21 =(x 2 ,y 1 )Q 12 =(x 1 ,y 2 )Q 22 =(x 2 ,y 2 ) The value of the four points.
And 3.5, classifying the candidate coordinate ROI, performing frame regression and mask prediction, returning a prediction score, coordinates and a label to a result, judging the accuracy of identification according to the score, outputting the coordinates and the label with the score larger than 85 as the result, and determining the position of the output coordinates to be a real table area.
Step 3.6, segmenting and storing the real table area; after the coordinates of the real table are acquired, the table is cut and stored by a crop () method in the PIL, and the stored table and the image 1 are output.
And 4, identifying the illustration area in the image 1, extracting the illustration, and outputting the illustration and the image 2 after the illustration is extracted. The specific process is as follows:
step 4.1, setting the cut part in the image 1 as a background color, namely, circularly setting all pixel points of the cut part, and setting each pixel point as the background color to be used as the input of identifying the illustration;
step 4.2, inputting the image into the illustration recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the interpolation region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN network to filter a part of the candidate coordinate ROIs, finally, conducting ROIAlign operation on the rest ROIs, obtaining pixel values of coordinates of fixed four points through bilinear interpolation, and at the moment, determining the position of a real interpolation region through the coordinates of the fixed four points.
4.3, dividing and storing the interpolation region; and after the coordinate position of the image insertion area is acquired, cutting and storing the image through a crop () method in the PIL, and outputting the stored image and the image 2.
Step 5, removing the red seal in the image 2, and outputting to obtain an image 3; the specific process is as follows:
step 5.1, setting a threshold value: the threshold function automatically finds the optimal threshold for threshold by passing in a parameter cv2.Thresh _ OTSU in the cv2.Threshold function and setting the initial value of threshold thresh to 0, cv2.Threshold function.
Step 5.2, removing the red seal, namely splitting three channels of the RGB image to obtain gray level images of blue, green and red channels, and then carrying out binarization by adopting a cv2.Threshold () method in OpenCV, namely processing a part (namely a red seal part) of a gray image with the pixel gray value larger than an optimal threshold value thresh into white, and processing a part (namely a red seal part) of the gray image smaller than the optimal threshold value into black, namely processing a corresponding red seal part into white, and considering that the red seal is removed; and after the processing is finished, outputting the image 3 without the red seal.
And 6, finally identifying and marking the titles in the image 3, and outputting the unmarked part as a text. The specific process is as follows:
and 6.1, setting the cut part in the image 3 as a background color, namely, circulating all pixel points of the cut part, and setting each pixel point as the background color to be used as the input of the identification title.
Step 6.2, inputting the image into the title recognition model; firstly, removing an image of an image to be interpolated, inputting the image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the header region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN network to filter a part of the candidate coordinate ROIs, and finally, carrying out ROIAlign operation on the rest ROIs to obtain pixel values of coordinates of four fixed points through bilinear interpolation, wherein the coordinates of the four fixed points are the positions of the real header region.
Step 6.3, identifying the title and selecting the title in a frame; marking a title area with a label title, wherein the part which is not identified is a text; and outputting the image marked with the title.
Mask R-cnn can carry out multi-classification target detection, but when layout analysis is carried out on a scanned file, the recognition effect is poor, so that a multi-classification task is converted into a plurality of two-classification tasks, and the multi-classification tasks work in series. Namely, firstly, identifying the form, secondly identifying the illustration, then removing the red seal, and finally identifying the title, wherein the unrecognized part is output as a text, and the purpose is to improve the accuracy of layout analysis.
The invention proves the feasibility and the superiority of the method, carries out verification experiments and comparison experiments, and selects twenty pictures in total for preventing contingency in the experiments, wherein each page of nine pictures only contains a scanning document of a table, an illustration or a title, each page of nine pictures contains a scanning document of a table, an illustration, a table, a title, an illustration and a title, and each page of three pictures contains a scanning document of a table, an illustration and a title. Twenty images are input one by one, and the following steps are specific steps of the verification experiment:
and (3) experimental environment configuration: windows10 operating System, AMD Ryzen 3600XCPU@4.4GHz、16GB RAM、python3.8、PyTorch1.1.1
Experiment selection different data sets are made for identifying different types, and specific information is shown in table 1:
table 1 data set information
Figure BDA0003844157470000081
Inputting: image X.
Output 1: the image X extracts the table, and X' is the table except X.
And (3) outputting 2: image X' extracts the results after the interpolation, and X "is the result after the interpolation of X is removed.
And (3) outputting: image X "boxed the title.
Step 1: inputting a scanning file into a pre-trained ResNet101 to obtain a corresponding table area;
step 2: setting a predetermined number of ROIs for each point in this table area, thereby obtaining a plurality of candidate ROIs;
and step 3: sending the candidate ROI into an RPN network to perform binary classification (foreground or background) and BB regression, and filtering out a part of candidate ROI;
and 4, step 4: ROIAlign operation is carried out on the rest ROIs, namely, the original image is firstly corresponding to the pixels of the feature image, and then the feature image is corresponding to the fixed features;
and 5: and classifying the candidate coordinate ROI, performing frame regression and mask prediction, returning a prediction score, coordinates and a label to a result, judging the accuracy of identification according to the score, and outputting the coordinates and the label with the score larger than 85 as the result.
Step 6: dividing and storing the table;
and 7: setting the cut part of the cut scanning image as a background color as the input of the identification illustration;
and step 8: repeating the steps 1-8, identifying the illustration and taking the scanned file without the illustration as the input of the identification title;
step 9; and (5) repeating the steps 1-7, identifying the title, selecting the title in a frame, marking a label title, and outputting the part which is not identified as a text.
Output 1: a table in image X;
and (3) outputting 2: an inset in image X;
and (3) outputting: image X is marked with title image Y after removing the table, the illustration, and the red chapter.
To test the merits of the present invention, the results are shown in Table 2, compared with pp-structure method proposed by Feishment, alexNet method proposed by Wang et al, and multi-classification MASK R-CNN method.
TABLE 2 comparison of the different methods
Figure BDA0003844157470000091
As can be seen from Table 2, the accuracy of the proposed method in scanning the document is improved by 10% -15% over the multi-classification MASK R-CNN method on the same data set. Compared with the AlexNet method, the method is improved by 4.7%, and the form recognition rate of the pp-structure table recognition technology is higher because the pp-structure table recognition technology mainly uses a picture description model RARE based on an attention mechanism, but is lower in the recognition rate of the illustrations and the titles. Compared with pp-structure, the overall recognition rate of the method is improved by about 1.2%. Therefore, the method provided by the invention realizes better effect on the layout analysis and classification precision of the scanned file.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims (7)

1. A scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation is characterized in that a scanned original color image is used as input, firstly, a form in the image is identified and segmented, the image after the form is segmented is used as input, image interpolation is identified and segmented, then, a red chapter in the image is removed, finally, title identification is carried out on the image after the form, the image after the image interpolation and the red chapter are removed, and an unidentified part is output as a text part, and the method specifically comprises the following steps:
step 1, defining a loading interface and a Mask R-cnn model, and carrying out model training;
step 2, acquiring an original color image of a scanned file in real time, and preprocessing the original color image;
step 3, carrying out classification and identification based on the trained model, firstly identifying a table area in the color image based on the table identification model, extracting a table, and outputting the table and the image 1 after the table is extracted;
step 4, identifying an illustration area in the image 1, extracting an illustration, and outputting the illustration and the image 2 after the illustration is extracted;
step 5, removing the red seal in the image 2, and outputting to obtain an image 3;
and 6, finally, identifying and marking the titles in the image 3, wherein the part which is not marked is a text, and outputting the image marked with the titles.
2. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and type segmentation according to claim 1, wherein the specific process of the step 1 is as follows:
step 1.1, for defining a loading interface, firstly, rewriting an acquired historical data set, disordering and sequencing the data set, dividing a training set and a test set, secondly, acquiring a mapping relation between an image and a mask, converting the mask into a Tensor, and acquiring coordinates and a label of a mask for image segmentation; finally, enhancing the image in the data set, and converting the original image into a PyTorch tensor;
step 1.2, for defining a Mask R-cnn model, acquiring a model architecture and input feature numbers of the model, and modifying parameters of feature classes and Mask classes of an output model to be 2; the method also comprises a definition training and verification data loader, and a Mask R-cnn model is obtained by using an auxiliary function;
step 1.3, adopting different data sets for the identification form, the illustration, the title and the text, respectively training out models for the identification form, the illustration and the title, and finally storing the trained model parameters, wherein at this moment, three models are obtained by training: the method comprises the steps of a table recognition model, an illustration recognition model and a title recognition model, wherein each recognition model is obtained by adopting Mask R-cnn for training, and a residual error network ResNet101 is selected as a main feature extraction network of the Mask R-cnn;
optimization is performed by adopting an Adam optimizer in model training, so that the update weight and the deviation parameter theta of the model are adjusted t ,θ t The result is obtained from the formula (1),
Figure FDA0003844157460000011
wherein, theta t-1 Represents the deviation parameter obtained from the last iteration, alpha represents the default learning rate, and m i Means mean of gradient, v t Exponential moving average representing the square of a gradient,∈=10 -8
Wherein, the learning rate planning program is set to reduce the learning rate by 10 times every 3 epochs;
defining a multi-task loss function L for each RoI interface during model training, and continuously updating the learning rate by calling an optimizer; after training, storing the trained model parameters;
l is obtained from the formula (2),
L=L cLs +L Box +L mask (2)
wherein L is CLS Represents a classification loss, L BOX Represents the regression frame loss, L mask Indicating a loss of newly added mask.
3. The method for analyzing the layout of a scan file based on Mask R-cnn algorithm and type segmentation as claimed in claim 1, wherein in step 2, the original color image of the scan file is image-enhanced, and the enhanced image is converted into RGB three-channel format.
4. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and the type segmentation according to claim 1, wherein the specific process of the step 3 is as follows:
step 3.1, inputting the preprocessed scanned file image into a residual error network ResNet101 and a residual error network FPN which are pre-trained to obtain corresponding characteristic maps, namely inputting the scanned file image into the ResNet101 to extract tables, wherein the ResNet101 is divided into 5 stages, C1-C5 are corresponding outputs of each stage, and at the moment, all possible table areas corresponding to the scanned file are output;
3.2, setting candidate coordinate ROI for each point in all possible table areas so as to obtain a plurality of candidate coordinate ROIs;
3.3, sending the candidate coordinate ROIs into an RPN network for binary classification and BB regression so as to filter out a part of the candidate coordinate ROIs; the specific process is as follows: firstly, performing convolution and pooling on the candidate coordinate ROI to continuously reduce the size of a feature map of the candidate coordinate ROI; then, carrying out deconvolution interpolation operation, continuously increasing the characteristic graph of the image, and finally classifying each pixel value so as to realize accurate segmentation of the input image;
3.4, ROIAlign operation is carried out on the remaining candidate coordinates ROI, the original image and the pixels of the feature image are firstly corresponded, then the feature image and the fixed features are corresponded, the pixel values of the coordinates of four fixed points are obtained through bilinear interpolation, the bilinear interpolation is linear interpolation in two directions, the linear interpolation is carried out in the x direction firstly, and the pixel values are obtained through formulas (3) and (4); then, interpolation is carried out in the y direction, which is obtained by the formula (5),
Figure FDA0003844157460000021
Figure FDA0003844157460000022
Figure FDA0003844157460000023
where f (P) denotes the coordinate of the evaluation function f at point P = (x, y), R 1 =(x,y 1 ),R 2 =(x,y 2 ) Assuming that the known function is at Q 11 =(x 1 ,y 1 ),Q 21 =(x 2 ,y 1 )Q 12 =(x 1 ,y 2 )Q 22 =(x 2 ,y 2 ) The values of the four points;
step 3.5, classifying the candidate coordinate ROI, performing frame regression and mask prediction, returning a prediction score, coordinates and a label to a result, judging the accuracy of identification according to the score, outputting the coordinates and the label with the score larger than 85 as the result, and determining the position of the output coordinates as a real table area;
step 3.6, segmenting and storing the real table area; after the coordinates of the real table are obtained, cutting and storing are carried out through a crop () method in the PIL, and the stored table and the image 1 are output.
5. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and the type segmentation according to claim 1, wherein the specific process of the step 4 is as follows:
step 4.1, setting the cut part in the image 1 as a background color, circulating all pixel points of the cut part once, and setting each pixel point as the background color to be used as the input of identifying the illustration;
step 4.2, inputting the image into the illustration recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the interpolation region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN network to filter a part of the candidate coordinate ROIs, finally, conducting ROIAlign operation on the rest ROIs, obtaining pixel values of coordinates of fixed four points through bilinear interpolation, and determining the position of a real interpolation region through the coordinates of the fixed four points;
4.3, dividing and storing the interpolation region; and after the coordinate position of the image area is acquired, cutting and storing the image area through a crop () method in the PIL, and outputting the stored image area and the image 2.
6. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and type segmentation as claimed in claim 1, wherein the specific process of the step 5 is as follows:
step 5.1, setting a threshold value: a parameter cv2.Threshold _ OTSU is introduced into the cv2.Threshold function, and the initial value of the threshold is set to 0, the cv2.Threshold function automatically finds the optimal threshold of the threshold;
step 5.2, removing red chapters, namely splitting three channels of an RGB image to obtain gray level images of blue, green and red channels, carrying out binarization by adopting a cv2.Threshold () method in OpenCV, and processing the red chapters of which the pixel gray level of a gray image is greater than an optimal threshold value thresh into white and the red chapters of which the pixel gray level is less than the optimal threshold value into black; and after the processing is finished, outputting the image 3 without the red seal.
7. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and type segmentation according to claim 1, wherein the specific process of the step 6 is as follows:
step 6.1, setting the cut part in the image 3 as a background color, circulating all pixel points of the cut part, and setting each pixel point as the background color to be used as the input of the identification title;
step 6.2, inputting the image into the title recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the header region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN network to filter a part of the candidate coordinate ROIs, and finally, carrying out ROIAlign operation on the rest ROIs to obtain pixel values of coordinates of four fixed points through bilinear interpolation, wherein the coordinates of the four fixed points are the positions of the real header region;
step 6.3, identifying the title, selecting the title in a frame, marking a label title on the title area, and outputting the part which is not identified as a text; and finally, outputting the image marked with the title.
CN202211119268.8A 2022-09-14 2022-09-14 Mask R-cnn algorithm and type segmentation based scanned file layout analysis method Pending CN115578741A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211119268.8A CN115578741A (en) 2022-09-14 2022-09-14 Mask R-cnn algorithm and type segmentation based scanned file layout analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211119268.8A CN115578741A (en) 2022-09-14 2022-09-14 Mask R-cnn algorithm and type segmentation based scanned file layout analysis method

Publications (1)

Publication Number Publication Date
CN115578741A true CN115578741A (en) 2023-01-06

Family

ID=84580340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211119268.8A Pending CN115578741A (en) 2022-09-14 2022-09-14 Mask R-cnn algorithm and type segmentation based scanned file layout analysis method

Country Status (1)

Country Link
CN (1) CN115578741A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994282A (en) * 2023-09-25 2023-11-03 安徽省交通规划设计研究总院股份有限公司 Reinforcing steel bar quantity identification and collection method for bridge design drawing
CN117473980A (en) * 2023-11-10 2024-01-30 中国医学科学院医学信息研究所 Structured analysis method of portable document format file and related products

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994282A (en) * 2023-09-25 2023-11-03 安徽省交通规划设计研究总院股份有限公司 Reinforcing steel bar quantity identification and collection method for bridge design drawing
CN116994282B (en) * 2023-09-25 2023-12-15 安徽省交通规划设计研究总院股份有限公司 Reinforcing steel bar quantity identification and collection method for bridge design drawing
CN117473980A (en) * 2023-11-10 2024-01-30 中国医学科学院医学信息研究所 Structured analysis method of portable document format file and related products

Similar Documents

Publication Publication Date Title
CN110738207B (en) Character detection method for fusing character area edge information in character image
CN104751142B (en) A kind of natural scene Method for text detection based on stroke feature
CN111401372B (en) Method for extracting and identifying image-text information of scanned document
CN105528614B (en) A kind of recognition methods of the cartoon image space of a whole page and automatic recognition system
CN115578741A (en) Mask R-cnn algorithm and type segmentation based scanned file layout analysis method
CN110766017B (en) Mobile terminal text recognition method and system based on deep learning
CN112036231B (en) Vehicle-mounted video-based lane line and pavement indication mark detection and identification method
CN110647795A (en) Form recognition method
CN115331245B (en) Table structure identification method based on image instance segmentation
CN112883795B (en) Rapid and automatic table extraction method based on deep neural network
CN113065396A (en) Automatic filing processing system and method for scanned archive image based on deep learning
CN113160185A (en) Method for guiding cervical cell segmentation by using generated boundary position
CN116824608A (en) Answer sheet layout analysis method based on target detection technology
CN110414517B (en) Rapid high-precision identity card text recognition algorithm used for being matched with photographing scene
WO2021034841A1 (en) Apparatus and methods for converting lineless tables into lined tables using generative adversarial networks
CN115082776A (en) Electric energy meter automatic detection system and method based on image recognition
JP2011248702A (en) Image processing device, image processing method, image processing program, and program storage medium
JP2002207964A (en) Device and method for image processing for generating binary image from multi-valued image
KR101571681B1 (en) Method for analysing structure of document using homogeneous region
CN111444903B (en) Method, device and equipment for positioning characters in cartoon bubbles and readable storage medium
CN111612802A (en) Re-optimization training method based on existing image semantic segmentation model and application
CN114511862B (en) Form identification method and device and electronic equipment
CN115588208A (en) Full-line table structure identification method based on digital image processing technology
CN114332866A (en) Document curve separation and coordinate information extraction method based on image processing
CN112581487A (en) Method for automatically extracting detection area and positioning kernel

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination