CN115578741A

CN115578741A - Mask R-cnn algorithm and type segmentation based scanned file layout analysis method

Info

Publication number: CN115578741A
Application number: CN202211119268.8A
Authority: CN
Inventors: 赵卫东; 张晓明; 李旭健; 肖智勇
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2023-01-06

Abstract

The invention provides a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation, and belongs to the field of deep learning. The method is mainly realized by adopting a technical scheme of type segmentation based on a Mask R-cnn algorithm, and aims to improve the accuracy of layout analysis. The method specifically comprises the following steps; taking a scanned file color image as input, firstly identifying and dividing a table, taking an image obtained after dividing the table as input, identifying and dividing an illustration, then removing red chapters, and finally identifying a title of the image obtained after removing the table, the illustration and the red chapters, and outputting an unidentified part as a text part. In the scanned file with the complex layout, the method solves the problem of low accuracy of top-down, bottom-up and comprehensive methods, so that the technologies of image classification, text processing, OCR and the like can be further optimized, and finally the analysis accuracy of the layout of the scanned file is improved.

Description

Scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation

Technical Field

The invention belongs to the field of deep learning, and particularly relates to a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation.

Background

The analysis and research on the layout of the scanned file are of great significance. First, the scanned document has the advantages of being portable, easy to store, readable but not modifiable at will, and so on, and thus the scanned document is widely used. Secondly, the electronization of the scanned file can well save the data, the layout analysis of the scanned file is an important step for forming the electronization of the scanned file, and the accuracy can be ensured to a great extent by correctly and effectively analyzing the layout of the scanned file. Meanwhile, with the continuous emergence of technologies such as image classification and text processing, the layout analysis technology becomes more important, so if the text and illustrations in the scanned document are to be processed intelligently, the layout analysis is a necessary step to solve the problem.

The layout analysis comprises three types, namely top-down type, bottom-up type and comprehensive type. Starting from the whole document image from top to bottom, different regions are segmented from the whole page by adopting a related algorithm, and the algorithm is suitable for the document image with more standard and concise layout. The algorithm from bottom to top is gradually merged into different document areas from local information through modes of connected domains and the like, the algorithm is suitable for more complex layout analysis, but the algorithm efficiency is often low, and a uniform merging rule is difficult to form.

Based on the advantages and disadvantages of the two algorithms, the comprehensive algorithm fuses the top-down algorithm and the bottom-up algorithm, and is also the most common method in layout analysis. Yang proposes a texture-based analysis method, which analyzes a document image according to the line intervals and the line texture features of an image table in the document image. Tian et al propose a bottom-up and top-down combined hybrid by performing corresponding combination according to the characteristics of the distance between the connected regions, the transverse and longitudinal arrangement of the sizes of the connected regions, the reference lines and the like. However, for a relatively complex type of a page of a scanned document, it is difficult to extract elements in the scanned document, such as red chapters and irregular tables, and these areas are difficult to divide and have low accuracy. Therefore, a method for identifying and classifying scan files more efficiently is needed.

Mask R-cnn is an algorithm used to detect objects in an image while generating a high quality segmentation Mask for each instance. This method, called masked area convolutional neural network, extends the area convolutional neural network faster by adding a branch to the existing branch to predict the target mask in parallel. However, the Mask R-cnn algorithm is used for processing the layout analysis multi-classification problem, all tables, illustrations, titles and texts are identified at the same time, the accuracy is low, and certain problems exist when the Mask R-cnn algorithm is used for multi-target detection.

Disclosure of Invention

The invention provides a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation, aiming at the problems of low accuracy and high loss rate of a scanning file in the aspect of layout analysis, and the core of the method is to provide a new method: and analyzing the layout of the scanned file by using a type segmentation method based on a Mask R-cnn algorithm and the type segmentation method, and accurately identifying the table, the illustration, the title, the characters and the red seal part respectively.

In order to achieve the purpose, the invention adopts the following technical scheme:

a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation takes a scanned original color image as input, firstly identifies and segments a table in the image, takes the image after the table is segmented as input, identifies and segments an illustration, then removes a red seal in the image, finally identifies a title of the image after the table, the illustration and the red seal are removed, and outputs an unidentified part as a text part, and specifically comprises the following steps:

step 1, defining a loading interface and a Mask R-cnn model, and carrying out model training;

step 2, acquiring an original color image of a scanned file in real time and preprocessing the original color image;

step 3, carrying out classification and identification based on the trained model, firstly identifying a table area in the color image based on the table identification model, extracting a table, and outputting the table and the image 1 after the table is extracted;

step 4, identifying an illustration area in the image 1, extracting an illustration, and outputting the illustration and the image 2 after the illustration is extracted;

step 5, removing the red seal in the image 2, and outputting to obtain an image 3;

and 6, finally, identifying and marking the title in the image 3, wherein the part which is not marked is a text, and outputting the image marked with the title.

Further, the specific process of step 1 is as follows:

step 1.1, for defining a loading interface, firstly, rewriting an acquired historical data set, disordering and sequencing the data set, dividing a training set and a test set, secondly, acquiring a mapping relation between an image and a mask, converting the mask into a Tensor, and acquiring coordinates and a label of a mask for image segmentation; finally, enhancing the image in the data set, and converting the original image into a PyTorch tensor;

step 1.2, for defining a Mask R-cnn model, acquiring a model architecture and input feature numbers of the model, and modifying parameters of a feature type and a Mask type of an output model into 2; the method also comprises a definition training and verification data loader, and a Mask R-cnn model is obtained by using an auxiliary function;

step 1.3, adopting different data sets for the recognition table, the illustration, the title and the text, respectively training models of the recognition table, the illustration and the title, and finally storing the trained model parameters, wherein the training obtains three models: the method comprises the steps of a table recognition model, an illustration recognition model and a title recognition model, wherein each recognition model is obtained by adopting Mask R-cnn for training, and a residual error network ResNet101 is selected as a main feature extraction network of the Mask R-cnn;

optimizing by adopting an Adam optimizer in model training so as to adjust model updating weight and deviation parameter theta _t ，θ _t The result is obtained from the formula (1),

wherein, theta _t-1 Represents the deviation parameter obtained from the last iteration, alpha represents the default learning rate, and m _i Means mean of gradient, v _t Exponential moving average representing the square of the gradient, ∈ =10 ^-8 ；

Wherein, the learning rate planning program is set to reduce the learning rate by 10 times every 3 epochs;

defining a multi-task loss function L for each RoI interface during model training, and continuously updating the learning rate by calling an optimizer; after training, storing the trained model parameters;

l is obtained by the formula (2),

L＝L _CLS +L _BOX +L _mask (2)

wherein L is _CLS Represents a classification loss, L _BOX Represents the regression frame loss, L _mask Indicating a loss of newly added mask.

Further, in step 2, the original color image of the scanned file is subjected to image enhancement, and the enhanced image is converted into an RGB three-channel format.

Further, the specific process of step 3 is as follows:

step 3.1, inputting the preprocessed scanned file image into a residual error network ResNet101 and a residual error network FPN which are pre-trained to obtain a corresponding characteristic diagram, namely inputting the scanned file image into the ResNet101 to extract a table, dividing the ResNet101 into 5 stages, outputting C1-C5 corresponding to each stage, and outputting all possible table areas corresponding to the scanned file at the moment;

3.2, setting candidate coordinate ROI for each point in all possible table areas, thereby obtaining a plurality of candidate coordinate ROIs;

3.3, sending the candidate coordinate ROIs into an RPN (resilient packet network) to carry out binary classification and BB regression so as to filter out a part of the candidate coordinate ROIs; the specific process is as follows: firstly, performing convolution and pooling on the candidate coordinate ROI to continuously reduce the size of a feature map of the candidate coordinate ROI; then, carrying out deconvolution interpolation operation, continuously increasing the characteristic graph of the image, and finally classifying each pixel value so as to realize accurate segmentation of the input image;

3.4, ROIAlign operation is carried out on the remaining candidate coordinates ROI, the original image and the pixels of the feature image are firstly corresponded, then the feature image and the fixed features are corresponded, the pixel values of the coordinates of four fixed points are obtained through bilinear interpolation, the bilinear interpolation is linear interpolation in two directions, the linear interpolation is carried out in the x direction firstly, and the pixel values are obtained through formulas (3) and (4); then, interpolation is carried out in the y direction, which is obtained by the formula (5),

where f (P) denotes the coordinate of the evaluation function f at point P = (x, y), R ₁ ＝(x,y ₁ )，R ₂ ＝(x,y ₂ ) Assuming that the known function is at Q ₁₁ ＝(x ₁ ,y ₁ )，Q ₂₁ ＝(x ₂ ,y ₁ )Q ₁₂ ＝(x ₁ ,y ₂ )Q ₂₂ ＝(x ₂ ,y ₂ ) The value of four points;

step 3.5, classifying the candidate coordinate ROI, performing frame regression and mask prediction, returning a prediction score, coordinates and a label to a result, judging the accuracy of identification according to the score, outputting the coordinates and the label with the score larger than 85 as the result, and determining the position of the output coordinates as a real table area;

step 3.6, segmenting and storing the real table area; after the coordinates of the real table are obtained, cutting and storing are carried out through a crop () method in the PIL, and the stored table and the image 1 are output.

Further, the specific process of step 4 is as follows:

step 4.1, setting the cut part in the image 1 as a background color, circulating all pixel points of the cut part, setting each pixel point as the background color, and using the background color as the input of the identification illustration;

step 4.2, inputting the image into the illustration recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the interpolation region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN (resilient packet network) to filter out a part of the candidate coordinate ROIs, and finally, carrying out ROIAlign operation on the rest ROIs, obtaining pixel values of coordinates of four fixed points through bilinear interpolation, and determining the position of a real interpolation region through the coordinates of the four fixed points.

4.3, dividing and storing the interpolation region; and after the coordinate position of the image insertion area is acquired, cutting and storing the image through a crop () method in the PIL, and outputting the stored image and the image 2.

Further, the specific process of step 5 is as follows:

step 5.1, setting a threshold value: a parameter cv2.Threshold _ OTSU is introduced into the cv2.Threshold function, and the initial value of the threshold is set to 0, the cv2.Threshold function automatically finds the optimal threshold of the threshold;

step 5.2, removing red chapters, namely splitting three channels of an RGB image to obtain gray level images of blue, green and red channels, carrying out binarization by adopting a cv2.Threshold () method in OpenCV, and processing the red chapters of which the pixel gray level of a gray image is greater than an optimal threshold value thresh into white and the red chapters of which the pixel gray level is less than the optimal threshold value into black; and after the processing is finished, outputting the image 3 without the red seal.

Further, the specific process of step 6 is as follows:

step 6.1, setting the cut part in the image 3 as a background color, circulating all pixel points of the cut part, and setting each pixel point as the background color to be used as the input of the identification title;

step 6.2, inputting the image into the title recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic diagram, secondly, setting a candidate coordinate ROI for each point in the header region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN (resilient packet network) to filter out a part of the candidate coordinate ROIs, and finally, carrying out ROIAlign operation on the rest ROIs, obtaining pixel values of coordinates of four fixed points through bilinear interpolation, wherein the coordinates of the four fixed points are the positions of the real header region;

step 6.3, identifying the title, selecting the title in a frame mode, marking a title area with a label title, and outputting an unidentified part as a text; and finally, outputting the image marked with the title.

The invention has the following beneficial technical effects:

1. aiming at the complex layout of the scanned file, the method adopts a mask region convolution neural network to carry out layout analysis.

2. Aiming at the problem that Mask R-cnn has low accuracy in simultaneously identifying tables, illustrations, titles and characters, a type segmentation method is provided, a scanned image is used as input, firstly, the tables are identified and segmented, the images obtained after the tables are segmented are used as input, the illustrations are identified and segmented, finally, the titles of the images obtained after the tables and the illustrations are removed are identified, unidentified parts are output as text parts, and the accuracy is improved by 10% -15%.

3. The method aims at the problem that the red seal exists in part of scanned files, so that the accuracy rate is low when the titles and the texts are recognized, the red seal is removed, and the recognition accuracy is further improved.

Drawings

FIG. 1 is a flow chart of a method for analyzing the layout of a scanned document based on the Mask R-cnn algorithm and type segmentation in accordance with the present invention;

FIG. 2 is a schematic diagram of the training process of three models in the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

as shown in FIG. 1, the invention provides a scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation, wherein a scanning file acquires an original color image, a table in the image is firstly identified and segmented, the image after the table is segmented is used as input, the image after the table is segmented is identified and segmented, then a red chapter in the image is removed, finally the image after the table, the image after the table and the red chapter are removed is subjected to title identification, and an unidentified part is output as a text part. The method specifically comprises the following steps:

step 1, defining a loading interface and a Mask R-cnn model, and carrying out model training. The specific process is as follows:

step 1.1, for defining a loading interface, firstly, rewriting operation is carried out on an obtained historical data set, the data set is subjected to disordering and sequencing, a training set and a test set are divided, secondly, the mapping relation between an image and a mask is obtained, the mask is converted into a Tensor, and the coordinate and the label of a mask are obtained and are used for image segmentation. And finally, performing enhancement processing on the image in the data set, and converting the original image into a PyTorch tensor.

Step 1.2, for defining a Mask R-cnn model, acquiring a model architecture and input feature numbers of the model, and modifying parameters of a feature type and a Mask type of an output model into 2; and defining a training and verification data loader, and acquiring a Mask R-cnn model by using an auxiliary function.

Step 1.3, adopting different data sets for the recognition form, the illustration, the title and the text, respectively training out models for the recognition form, the illustration and the title, and finally saving the trained model parameters, as shown in fig. 2, at this time, training obtains three models: the method comprises a table recognition model, an illustration recognition model and a title recognition model, wherein each recognition model is obtained by adopting Mask R-cnn for training, and a residual error network ResNet101 is selected as a main feature extraction network of the Mask R-cnn.

Optimization is performed by adopting an Adam optimizer in model training, so that the update weight and the deviation parameter theta of the model are adjusted _t ，θ _t Can be obtained from the formula (1),

wherein, theta _t-1 Represents the deviation parameter obtained from the last iteration, alpha represents the default learning rate, and m _i Means mean of gradient, v _t Exponential moving average representing the square of the gradient, ∈ =10 ^-8 The divisor is avoided to be 0.

Wherein the learning rate planning procedure is set to reduce the learning rate by a factor of 10 every 3 epochs.

Defining a multi-task loss function L for each RoI interface during model training, and continuously updating the learning rate by calling an optimizer; and after the training is finished, storing the trained model parameters.

L can be obtained from the formula (2),

L＝L _CLS +L _BOX +L _mask (2)

Step 2, acquiring an original color image of a scanned file in real time, and preprocessing the original color image, wherein the preprocessing specifically comprises the following steps: and carrying out image enhancement on the original color image of the scanned file, and converting the enhanced image into an RGB three-channel format.

And 3, carrying out classification and identification based on the trained model, firstly identifying a table area in the color image based on the table identification model, extracting a table, and outputting the table and the image 1 after the table is extracted. The specific process is as follows:

step 3.1, inputting the preprocessed scanned file image into a residual error network ResNet101 and FPN which are trained to obtain corresponding feature maps, namely inputting the scanned file image into ResNet101 to extract tables, dividing ResNet101 into 5 stages, outputting C1-C5 corresponding to each stage, and outputting all possible table regions corresponding to the scanned file at the moment (at the moment, the output table regions contain regions with recognition errors, and the error recognition regions need to be further removed);

3.3, sending the candidate coordinate ROIs into an RPN (resilient packet network) to carry out binary classification (foreground or background) and BB regression so as to filter out a part of candidate coordinate ROIs; the specific process is as follows: firstly, performing convolution and pooling on the candidate coordinate ROI to continuously reduce the size of a feature map of the candidate coordinate ROI; and then carrying out deconvolution operation, namely carrying out interpolation operation, continuously increasing the characteristic diagram, and finally classifying each pixel value, thereby realizing accurate segmentation of the input image.

And 3.4, performing ROIAlign operation on the remaining candidate coordinates ROI, namely, firstly, corresponding the pixels of the original image and the feature image, then, corresponding the feature image and the fixed features, and obtaining the pixel values of the coordinates of the fixed four points through bilinear interpolation, wherein the bilinear interpolation is essentially to perform linear interpolation in two directions. Firstly, linear interpolation is carried out in the x direction, and the linear interpolation can be obtained by formulas (3) and (4); interpolation in the y-direction is then performed, as can be seen from equation (5), so that the discontinuous operation becomes continuous and the error is smaller when returning to the original.

Where f (P) denotes the coordinate of the function f at point P = (x, y), R ₁ ＝(x,y ₁ )，R ₂ ＝(x,y ₂ ) Assuming that the known function is at Q ₁₁ ＝(x ₁ ,y ₁ )，Q ₂₁ ＝(x ₂ ,y ₁ )Q ₁₂ ＝(x ₁ ,y ₂ )Q ₂₂ ＝(x ₂ ,y ₂ ) The value of the four points.

And 3.5, classifying the candidate coordinate ROI, performing frame regression and mask prediction, returning a prediction score, coordinates and a label to a result, judging the accuracy of identification according to the score, outputting the coordinates and the label with the score larger than 85 as the result, and determining the position of the output coordinates to be a real table area.

Step 3.6, segmenting and storing the real table area; after the coordinates of the real table are acquired, the table is cut and stored by a crop () method in the PIL, and the stored table and the image 1 are output.

And 4, identifying the illustration area in the image 1, extracting the illustration, and outputting the illustration and the image 2 after the illustration is extracted. The specific process is as follows:

step 4.1, setting the cut part in the image 1 as a background color, namely, circularly setting all pixel points of the cut part, and setting each pixel point as the background color to be used as the input of identifying the illustration;

step 4.2, inputting the image into the illustration recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the interpolation region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN network to filter a part of the candidate coordinate ROIs, finally, conducting ROIAlign operation on the rest ROIs, obtaining pixel values of coordinates of fixed four points through bilinear interpolation, and at the moment, determining the position of a real interpolation region through the coordinates of the fixed four points.

Step 5, removing the red seal in the image 2, and outputting to obtain an image 3; the specific process is as follows:

step 5.1, setting a threshold value: the threshold function automatically finds the optimal threshold for threshold by passing in a parameter cv2.Thresh _ OTSU in the cv2.Threshold function and setting the initial value of threshold thresh to 0, cv2.Threshold function.

Step 5.2, removing the red seal, namely splitting three channels of the RGB image to obtain gray level images of blue, green and red channels, and then carrying out binarization by adopting a cv2.Threshold () method in OpenCV, namely processing a part (namely a red seal part) of a gray image with the pixel gray value larger than an optimal threshold value thresh into white, and processing a part (namely a red seal part) of the gray image smaller than the optimal threshold value into black, namely processing a corresponding red seal part into white, and considering that the red seal is removed; and after the processing is finished, outputting the image 3 without the red seal.

And 6, finally identifying and marking the titles in the image 3, and outputting the unmarked part as a text. The specific process is as follows:

and 6.1, setting the cut part in the image 3 as a background color, namely, circulating all pixel points of the cut part, and setting each pixel point as the background color to be used as the input of the identification title.

Step 6.2, inputting the image into the title recognition model; firstly, removing an image of an image to be interpolated, inputting the image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the header region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN network to filter a part of the candidate coordinate ROIs, and finally, carrying out ROIAlign operation on the rest ROIs to obtain pixel values of coordinates of four fixed points through bilinear interpolation, wherein the coordinates of the four fixed points are the positions of the real header region.

Step 6.3, identifying the title and selecting the title in a frame; marking a title area with a label title, wherein the part which is not identified is a text; and outputting the image marked with the title.

Mask R-cnn can carry out multi-classification target detection, but when layout analysis is carried out on a scanned file, the recognition effect is poor, so that a multi-classification task is converted into a plurality of two-classification tasks, and the multi-classification tasks work in series. Namely, firstly, identifying the form, secondly identifying the illustration, then removing the red seal, and finally identifying the title, wherein the unrecognized part is output as a text, and the purpose is to improve the accuracy of layout analysis.

The invention proves the feasibility and the superiority of the method, carries out verification experiments and comparison experiments, and selects twenty pictures in total for preventing contingency in the experiments, wherein each page of nine pictures only contains a scanning document of a table, an illustration or a title, each page of nine pictures contains a scanning document of a table, an illustration, a table, a title, an illustration and a title, and each page of three pictures contains a scanning document of a table, an illustration and a title. Twenty images are input one by one, and the following steps are specific steps of the verification experiment:

and (3) experimental environment configuration: windows10 operating System, AMD Ryzen 3600XCPU@4.4GHz、16GB RAM、python3.8、PyTorch1.1.1

Experiment selection different data sets are made for identifying different types, and specific information is shown in table 1:

table 1 data set information

Inputting: image X.

Output 1: the image X extracts the table, and X' is the table except X.

And (3) outputting 2: image X' extracts the results after the interpolation, and X "is the result after the interpolation of X is removed.

And (3) outputting: image X "boxed the title.

Step 1: inputting a scanning file into a pre-trained ResNet101 to obtain a corresponding table area;

step 2: setting a predetermined number of ROIs for each point in this table area, thereby obtaining a plurality of candidate ROIs;

and step 3: sending the candidate ROI into an RPN network to perform binary classification (foreground or background) and BB regression, and filtering out a part of candidate ROI;

and 4, step 4: ROIAlign operation is carried out on the rest ROIs, namely, the original image is firstly corresponding to the pixels of the feature image, and then the feature image is corresponding to the fixed features;

and 5: and classifying the candidate coordinate ROI, performing frame regression and mask prediction, returning a prediction score, coordinates and a label to a result, judging the accuracy of identification according to the score, and outputting the coordinates and the label with the score larger than 85 as the result.

Step 6: dividing and storing the table;

and 7: setting the cut part of the cut scanning image as a background color as the input of the identification illustration;

and step 8: repeating the steps 1-8, identifying the illustration and taking the scanned file without the illustration as the input of the identification title;

step 9; and (5) repeating the steps 1-7, identifying the title, selecting the title in a frame, marking a label title, and outputting the part which is not identified as a text.

Output 1: a table in image X;

and (3) outputting 2: an inset in image X;

and (3) outputting: image X is marked with title image Y after removing the table, the illustration, and the red chapter.

To test the merits of the present invention, the results are shown in Table 2, compared with pp-structure method proposed by Feishment, alexNet method proposed by Wang et al, and multi-classification MASK R-CNN method.

TABLE 2 comparison of the different methods

As can be seen from Table 2, the accuracy of the proposed method in scanning the document is improved by 10% -15% over the multi-classification MASK R-CNN method on the same data set. Compared with the AlexNet method, the method is improved by 4.7%, and the form recognition rate of the pp-structure table recognition technology is higher because the pp-structure table recognition technology mainly uses a picture description model RARE based on an attention mechanism, but is lower in the recognition rate of the illustrations and the titles. Compared with pp-structure, the overall recognition rate of the method is improved by about 1.2%. Therefore, the method provided by the invention realizes better effect on the layout analysis and classification precision of the scanned file.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims

1. A scanning file layout analysis method based on Mask R-cnn algorithm and type segmentation is characterized in that a scanned original color image is used as input, firstly, a form in the image is identified and segmented, the image after the form is segmented is used as input, image interpolation is identified and segmented, then, a red chapter in the image is removed, finally, title identification is carried out on the image after the form, the image after the image interpolation and the red chapter are removed, and an unidentified part is output as a text part, and the method specifically comprises the following steps:

step 2, acquiring an original color image of a scanned file in real time, and preprocessing the original color image;

and 6, finally, identifying and marking the titles in the image 3, wherein the part which is not marked is a text, and outputting the image marked with the titles.

2. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and type segmentation according to claim 1, wherein the specific process of the step 1 is as follows:

step 1.2, for defining a Mask R-cnn model, acquiring a model architecture and input feature numbers of the model, and modifying parameters of feature classes and Mask classes of an output model to be 2; the method also comprises a definition training and verification data loader, and a Mask R-cnn model is obtained by using an auxiliary function;

step 1.3, adopting different data sets for the identification form, the illustration, the title and the text, respectively training out models for the identification form, the illustration and the title, and finally storing the trained model parameters, wherein at this moment, three models are obtained by training: the method comprises the steps of a table recognition model, an illustration recognition model and a title recognition model, wherein each recognition model is obtained by adopting Mask R-cnn for training, and a residual error network ResNet101 is selected as a main feature extraction network of the Mask R-cnn;

optimization is performed by adopting an Adam optimizer in model training, so that the update weight and the deviation parameter theta of the model are adjusted _t ，θ _t The result is obtained from the formula (1),

wherein, theta _t-1 Represents the deviation parameter obtained from the last iteration, alpha represents the default learning rate, and m _i Means mean of gradient, v _t Exponential moving average representing the square of a gradient，∈＝10 ^-8 ；

l is obtained from the formula (2),

L＝L _cLs +L _Box +L _mask (2)

3. The method for analyzing the layout of a scan file based on Mask R-cnn algorithm and type segmentation as claimed in claim 1, wherein in step 2, the original color image of the scan file is image-enhanced, and the enhanced image is converted into RGB three-channel format.

4. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and the type segmentation according to claim 1, wherein the specific process of the step 3 is as follows:

step 3.1, inputting the preprocessed scanned file image into a residual error network ResNet101 and a residual error network FPN which are pre-trained to obtain corresponding characteristic maps, namely inputting the scanned file image into the ResNet101 to extract tables, wherein the ResNet101 is divided into 5 stages, C1-C5 are corresponding outputs of each stage, and at the moment, all possible table areas corresponding to the scanned file are output;

3.2, setting candidate coordinate ROI for each point in all possible table areas so as to obtain a plurality of candidate coordinate ROIs;

3.3, sending the candidate coordinate ROIs into an RPN network for binary classification and BB regression so as to filter out a part of the candidate coordinate ROIs; the specific process is as follows: firstly, performing convolution and pooling on the candidate coordinate ROI to continuously reduce the size of a feature map of the candidate coordinate ROI; then, carrying out deconvolution interpolation operation, continuously increasing the characteristic graph of the image, and finally classifying each pixel value so as to realize accurate segmentation of the input image;

where f (P) denotes the coordinate of the evaluation function f at point P = (x, y), R ₁ ＝(x,y ₁ )，R ₂ ＝(x,y ₂ ) Assuming that the known function is at Q ₁₁ ＝(x ₁ ,y ₁ )，Q ₂₁ ＝(x ₂ ,y ₁ )Q ₁₂ ＝(x ₁ ,y ₂ )Q ₂₂ ＝(x ₂ ,y ₂ ) The values of the four points;

5. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and the type segmentation according to claim 1, wherein the specific process of the step 4 is as follows:

step 4.1, setting the cut part in the image 1 as a background color, circulating all pixel points of the cut part once, and setting each pixel point as the background color to be used as the input of identifying the illustration;

step 4.2, inputting the image into the illustration recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the interpolation region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN network to filter a part of the candidate coordinate ROIs, finally, conducting ROIAlign operation on the rest ROIs, obtaining pixel values of coordinates of fixed four points through bilinear interpolation, and determining the position of a real interpolation region through the coordinates of the fixed four points;

4.3, dividing and storing the interpolation region; and after the coordinate position of the image area is acquired, cutting and storing the image area through a crop () method in the PIL, and outputting the stored image area and the image 2.

6. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and type segmentation as claimed in claim 1, wherein the specific process of the step 5 is as follows:

7. The method for analyzing the layout of the scan file based on the Mask R-cnn algorithm and type segmentation according to claim 1, wherein the specific process of the step 6 is as follows:

step 6.2, inputting the image into the title recognition model; firstly, inputting an image into a pre-trained residual error network ResNet101 to obtain a corresponding characteristic map, secondly, setting a candidate coordinate ROI for each point in the header region to obtain a plurality of candidate coordinate ROIs, then, sending the candidate coordinate ROIs into an RPN network to filter a part of the candidate coordinate ROIs, and finally, carrying out ROIAlign operation on the rest ROIs to obtain pixel values of coordinates of four fixed points through bilinear interpolation, wherein the coordinates of the four fixed points are the positions of the real header region;

step 6.3, identifying the title, selecting the title in a frame, marking a label title on the title area, and outputting the part which is not identified as a text; and finally, outputting the image marked with the title.