CN116740720A - Photographing document bending correction method and device based on key point guidance - Google Patents

Photographing document bending correction method and device based on key point guidance Download PDF

Info

Publication number
CN116740720A
CN116740720A CN202311017033.2A CN202311017033A CN116740720A CN 116740720 A CN116740720 A CN 116740720A CN 202311017033 A CN202311017033 A CN 202311017033A CN 116740720 A CN116740720 A CN 116740720A
Authority
CN
China
Prior art keywords
document
control point
key points
image
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311017033.2A
Other languages
Chinese (zh)
Other versions
CN116740720B (en
Inventor
王秋锋
张伟光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong Liverpool University
Original Assignee
Xian Jiaotong Liverpool University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong Liverpool University filed Critical Xian Jiaotong Liverpool University
Priority to CN202311017033.2A priority Critical patent/CN116740720B/en
Publication of CN116740720A publication Critical patent/CN116740720A/en
Application granted granted Critical
Publication of CN116740720B publication Critical patent/CN116740720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/12Detection or correction of errors, e.g. by rescanning the pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/15Cutting or merging image elements, e.g. region growing, watershed or clustering-based techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19107Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application relates to the technical field of document identification, in particular to a photographing document bending correction method and device based on key point guidance, wherein the method comprises the following steps: inputting a curved document picture; constructing a reference sparse control point prediction module to realize the prediction of a reference control point; constructing a dense 3D shape prediction module, and extracting key points based on deformation gradients; constructing a bottom-up text line clustering module, and extracting content key points based on adjacent text lines; based on the obtained deformation key points and content key points, aligning the key points into a 2D reference control point grid, and finally realizing multi-level control point fusion; and correcting the multi-level control points under the guidance of the local key points, and outputting corrected documents. The application flexibly solves the problem of correcting the curved document.

Description

Photographing document bending correction method and device based on key point guidance
Technical Field
The application relates to the technical field of document identification, in particular to a photographing document bending correction method and device based on key point guidance.
Background
With the increasing popularity of smart phones in recent years, people are increasingly accustomed to using mobile phones to take photos to digitize paper documents (such as pages, notes, contracts, leaflets, etc.), rather than using conventional devices such as scanners. Although the photographing document greatly facilitates the life of people, the problem of lower quality of the photographing document is caused, and particularly, various geometric bending modes (mainly refer to the conditions of angle inclination, curling, folding, wrinkling and the like) are caused. These various geometric bending modes may not only lead to poor document readability compared to scanned documents, but also severely reduce the accuracy of recognition algorithms for flat documents (e.g., PDF documents, scanned documents), presenting new challenges to the study of document analysis. The automatic bending correction task of the photographed document aims at eliminating the phenomena of angle inclination, surface bending, convolution and wrinkling in the document image and recovering a flat and regular high-quality document image from the document image.
Under current deep learning wave, the task of correcting the bending of a photographed document is generally summarized as a control point mapping regression paradigm at a pixel level, wherein the control point mapping is also called reverse mapping or forward mapping regression. Typical work includes DocUNet, dewarpNet, DDCP, marior, etc. This paradigm describes the geometric deformation of a curved document by learning a set of 2D deformation control point maps from the curved document to a flat document, and finally using the predicted control point maps for pixel sampling to achieve the goal of curvature correction. There is still much room for improvement in two ways for this paradigm design approach if one proceeds from the source of the problem. On the one hand, from the functional point of view of the document, the photographed document is an important container for storing and propagating information, and the final goal of the correction is to promote the readability of the photographed document for humans and machines (machine-readable generally refers to the recognition rate of OCR algorithms). Therefore, how to explicitly construct a module for sensing and correcting document layout elements in a neural network structure becomes a problem to be solved. On the other hand, from the viewpoint of the geometric bending characteristics of photographed documents, paper documents inevitably introduce various geometric bending modes due to the factors of use and propagation, which is essentially a geometric deformation correction problem. While current 2D deformation control point mapping learning paradigms still do not address well for some complex bending modes (e.g., localized folds, high bending). Therefore, how to explicitly design a module capable of sensing different bending degrees to improve the current paradigm cannot solve the problem of complex bending modes, and is another worthy of investigation.
Disclosure of Invention
The application provides a photographing document bending correction method and device based on key point guidance, which can solve the problem that how to explicitly construct a sensing and correction module for document layout elements in a neural network structure is needed to be solved and how to explicitly design a module capable of sensing different bending degrees to improve the problem that the current paradigm cannot solve the complex bending modes. The application provides the following technical scheme:
in a first aspect, the present application provides a method for correcting a curvature of a photographed document based on key point guidance, the method comprising:
inputting a curved document picture;
constructing a reference sparse control point prediction module based on the ViT structure, and preliminarily encoding an input image into sequence features suitable for ViT processing by using an image slicer; providing absolute position coding information for the serialized image features by using cosine position coding after serialization, and finally realizing prediction of a reference control point based on a self-attention mechanism;
constructing a dense 3D shape prediction module based on the UNet structure, and extracting key points based on deformation gradients;
constructing a bottom-up text line clustering module based on a MaX-deep Lab network, and extracting content key points based on adjacent text lines;
based on the obtained deformation key points and content key points, aligning the key points into a 2D reference control point grid, and finally realizing multi-level control point fusion;
and correcting the multi-level control points under the guidance of the local key points, and outputting corrected documents.
In a specific embodiment, the input curved document picture includes:
photographing document image of arbitrary size (H, W, 3)Document image adjusted to uniform size +.>Recorded as->
Wherein 3 in the third dimension of the image size refers to the R, G, B three channels of the image; if a document image is photographedThe single-channel gray scale image is obtained by changing 3 into 1 or repeating the gray scale image three times.
In a specific embodiment, the predicting the reference control point based on the self-attention mechanism includes:
regarding the sequence units as token (language symbol), obtaining the feature comprising the modeling of the global token relation by calculating the self-attention or mutual attention between the respective token in the encoder and decoder; in the encoder section, several encoder blocks are used, each of which is stacked in cascade, and in each of which only a self-attention mechanism is used, and for the token of each image block, Q (Query), K (Key) and V (Value), respectively;
performing further nonlinear transformation through the full connection layer to obtain the output of a single encoder block;
through cascade coding of a plurality of encoder blocks, K (Key) and V (Value) output by an encoder are input as one part of a decoder, and a randomly initialized learnable query is used as the other part of the decoder;
in each decoder block, firstly, obtaining and extracting relation features among the learnable queries by using a self-attention mechanism, secondly, solving mutual attention between K (Key) and V (Value) obtained in an encoder and the learnable queries extracted by the self-attention relation, and finally, similarly, obtaining single decoder output through a full-connection layer;
finally, the prediction of the reference sparse control point can be realized through the dimension transformation of a plurality of layers at the output head of the decoder.
In a specific embodiment, the extracting the keypoints based on deformation gradient includes:
the gradient of the 3D shape along the horizontal direction is obtained, and the part with complex bending mode is extracted according to the set threshold value and is set as the deformation key point.
In a specific embodiment, the extracting content keypoints based on adjacent text lines includes:
panoramic segmentation is carried out on the photographed document, and masks of different text lines on the curved image and the flat image are obtained;
meanwhile, a plurality of learnable queries are added to the input part of the network and used for calculating an affinity matrix with masks of different text lines, and finally the text lines are clustered into respective paragraph classes;
and extracting the central points of the obtained text lines and the masks of the section outline to form key points of the layout elements of the multi-level document.
In a specific embodiment, the aligning the keypoints into the 2D reference control point grid based on the obtained deformation keypoints and the content keypoints, and finally realizing multi-level control point fusion includes:
for the deformation key points, firstly projecting the sparse sampled deformation points mapped to the 3D space to the corresponding positions on the 2D reference sparse control point grids;
for the content key points, the content key points are directly mapped to the space of the reference sparse control points.
In a specific embodiment, the modifying the multi-level control point under the guidance of the local key point includes:
the deformation control points obtained by guiding are utilized to carry out first correction on the parts with serious bending modes on the reference sparse control points;
and secondly, performing secondary correction on the content part on the reference sparse control point by using the content control point obtained by the guidance of the layout content key point.
In a specific embodiment, the outputting the corrected document further includes:
based on the obtained multi-level control point set, firstly adopting thin plate spline interpolation to obtain preliminary backward mapping;
and then, performing bilinear interpolation to obtain reverse mapping with the same resolution as the original image, and then performing pixel-by-pixel sampling according to the reverse mapping by a pixel sampling method to output a final corrected flat document.
In a second aspect, the present application provides an electronic device comprising a processor and a memory; the memory stores therein a program loaded and executed by the processor to implement a shot document curvature correction method based on the key point guidance as set forth in any one of claims 1 to 8.
In a third aspect, the present application provides a computer-readable storage medium having stored therein a program for implementing a shot document curvature correction method based on key point guidance as claimed in any one of claims 1 to 8 when executed by a processor.
In summary, the beneficial effects of the present application at least include:
(1) Compared with the prior method for representing the curved document target by using only uniform control point mapping, the method introduces the learnable key point guidance for the first time to flexibly solve the problem of correcting the curved document. The learnable key points are not limited by the feature that the reference control points must be matched with the flat document, so that the parts, such as document layout elements, complex bending modes and the like, of the curved document, which cannot be accurately represented by the reference control points can be flexibly described.
(2) Compared with the existing curved document correction method based on document layout elements. The text line key point extraction method provided by the application is not limited to simply introducing a text line prediction module, but takes a multi-level document structure into consideration, and forms a clustering module from text lines to paragraphs from bottom to top. At the same time, such a module design method is easier to introduce richer constraint methods, such as inter-line relationships, inter-paragraph relationships, and the like.
(3) Compared with the traditional curved document correction method added with the 3D shape prediction branch auxiliary task, the method is not used as an auxiliary task to help the backbone network learn better characteristics, but is used for extracting deformation key points from the characteristics of relatively easy acquisition of complex bending modes from the 3D shape, and mapping the deformation key points into a 2D sparse control point space. The design enables the model to be capable of sensing the difficulty degree of the bending mode of the bending document in a self-adaptive mode, reasonably distributes the density degree of the required control points, and saves the predicted parameter quantity of the model.
By supplementing a learnable key point extraction module on the basis of a reference control point, a flexible joint solution is provided for multi-level correction and complex bending mode correction of a photographed document.
The foregoing description is only an overview of the present application, and is intended to provide a better understanding of the present application, as it is embodied in the following description, with reference to the preferred embodiments of the present application and the accompanying drawings.
Drawings
Fig. 1 is a flowchart of a method for correcting a curvature of a photographed document based on key point guidance in an embodiment of the present application.
FIG. 2 is a block diagram of an electronic device for correction of curvature of a photographed document based on keypoint guidance in an embodiment of the application.
Detailed Description
The following describes in further detail the embodiments of the present application with reference to the drawings and examples. The following examples are illustrative of the application and are not intended to limit the scope of the application.
Optionally, the general cross-domain access control method provided by each embodiment of the present application is used for illustration in an electronic device, where the electronic device is a terminal or a server, and the terminal may be a mobile phone, a computer, a tablet computer, a scanner, an electronic eye, a monitoring camera, etc., and the embodiment does not limit the type of the electronic device.
The application aims to fully consider the flexibility brought by the learnable key points in the correction of the bending of the photographed document, thereby improving the effect of correcting the bending of the photographed document and providing high-quality flat document pictures for downstream tasks of document analysis. Therefore, two parallel key point extraction modules are constructed in the network designed by the application and are respectively used for extracting the key points of the document layout elements and the key points of the complex modes. And then, the key points are used as guidance, so that the fusion of the control points is realized, the global correction of the reference control points and the local correction of the multi-level control points are sequentially completed, and finally, the corrected flat document is output through post-processed pixel sampling. The basic process of the method is shown in fig. 1, wherein the dotted line frame is the key part of the application: multi-level control point prediction based on key point guidance.
Step 101, a curved document picture is input.
In practice, a photographed document image of arbitrary size (H, W, 3) is first of allDocument image adjusted to uniform size +.>Recorded as->Thereby facilitating the processing of the subsequent neural network model. Wherein H is a photographed document image +.>W is the pixel height of the photographed document image +.>Is>Document image of uniform size->Pixel height, +.>Document image of uniform size->Is a pixel width of (a) a pixel width of (b). Note that 3 in the third dimension of the image size refers to the R, G, B three channels of the image. If the document image is photographed ++>The single-channel gray scale image is obtained by changing 3 into 1 or repeating the gray scale image three times.
And step 102, predicting the reference control point.
The application constructs a reference sparse control point prediction module in this step, referencing a Vision Transformer (ViT) structure in the field of computer vision. First, an image slicer is used to slice an input imageThe preliminary coding is a sequence feature suitable for ViT processing.
Specifically, the input three-dimensional tensorThe original image is first of all according to +.>Is +.>Cut and then use ifPerforming preliminary feature extraction on a dry layer Convolutional Neural Network (CNN), and obtaining +.>Tensors of (c). Then flattening the 2D length and width of the first two dimensions of the tensor to obtain +.>Curved document object as a serialized representation, wherein,/>Representing the length of the sequence, and->Representing the dimension of the sequence.
After serialization, each sequence element is considered a token. The cosine position coding is then used to provide absolute position coding information for the serialized image features. The cosine position coding uses sine and cosine functions to obtain even and odd dimension coding numerical information respectively, and the coding numerical value is superimposed on each token, so that the network can obtain the absolute position information of each token explicitly. Further, features including global token relational modeling are obtained by calculating self-attention or mutual attention between the respective tokens in the encoder and decoder. Specifically, in the encoder section, several encoder blocks are used, each of which is superimposed in cascade, and in each of which only a self-attention mechanism is used, and for the token of each image block, it is respectively referred to as Q (Query), K (Key), and V (Value). The attention calculation process refers to a conventional calculation manner, namelyFurther nonlinear transformation is then performed through the full-join layer, resulting in the output of a single encoder block. Finally, warpThe K (Key) and V (Value) output by the encoder are input as part of the decoder and as another part using a randomly initialized learnable query. The decoder part also uses a plurality of cascade decoder blocks, in each decoder block, for the learnable queries, firstly, a self-attention mechanism is used for obtaining and extracting relation features among the learnable queries, secondly, K (Key) and V (Value) obtained in the encoder and the learnable queries extracted by the self-attention relation are used for obtaining mutual attention, and finally, the single decoder output is obtained through a full connection layer similarly. Finally, at the output head of the decoder, through the dimension transformation of a plurality of layers, the reference sparse control point can be realized>Is a prediction of (2).
And step 103, constructing a dense 3D shape prediction module to extract key points based on deformation gradients.
Since the bending mode is essentially a physical deformation derived from 3D space, the dense 3D shape prediction module employs a network of UNet structures for predicting dense document object 3D coordinate information.
Specifically, the 3D space being predictedScale and input model +.>Consistent but dimensionally contains coordinate information of three dimensions of x, y and z. Based on the above, the application obtains the gradient of the 3D shape along the horizontal direction, and the gradient is +_ according to the set threshold value>The part in which the bending mode is more complex is extracted and set as the deformation key point. The predicted dense 3D space is then sampled to be equal to the +.>The sparse size with the same length and width lays a foundation for the subsequent multi-level control point fusion.
And 104, constructing a bottom-up text line clustering module to extract content key points based on adjacent text lines.
This step aims at explicitly achieving the perception of the document layout element. The method comprises the steps that a MaX-deep Lab is used as a backbone network by a bottom-up text line clustering module, and is used for carrying out panoramic segmentation on a photographed document to obtain masks of different text lines on a curved image and a flat image. Meanwhile, a plurality of learnable queries are added to the input part of the network and used for calculating an affinity matrix with masks of different text lines, so that clustering among adjacent text lines is realized, and finally the text lines are clustered into respective paragraph classes. Based on the method, the central point of the mask of the obtained text line and the paragraph outline is extracted to form key points of the layout elements of the multi-level document. These keypoints contain two levels of information, the inner layer being obtained from the text line mask center point and the outer layer being upsampled from the paragraph outline. The key points among different paragraphs are mutually independent, and a foundation is laid for subsequent correction based on layout elements.
Optionally, the predicted text lines obtained by the bottom-up text line clustering module can also form a richer constraint method according to the inter-line relationship and the inter-paragraph relationship to help the text line prediction and the paragraph relative position relationship prediction obtain more accurate results due to the multi-level prediction results.
And 105, realizing multi-level control point fusion.
Based on the deformation key points and the content key points obtained in the step 103 and the step 104, aligning the key points into a 2D reference control point grid, and finally realizing multi-level control point fusion.
For the deformation key points, firstly, the deformation points after sparse sampling mapped to the step 103 to obtain a 3D space are projected to corresponding positions on a 2D reference sparse control point grid, and a control point set of the positions is recorded as. Wherein (1)>Representing the number of deformation keypoints, 2 representing the XY coordinate space domain.
For the content key points, the content key points can be directly mapped to the space of the reference sparse control points because the content key points are in the 2D space, and the control point set of the position is recorded as. Wherein (1)>Representing the number of content keypoints, 4 represents the curved and flat paired XY coordinate space domain. Thus, a reference control point space in which the deformation key point and the content key point are superimposed is obtained.
In the final correction process, the root is still dependent on the correspondence of the paired control points. Thus, these three types of control pointsAnd->And carrying out multi-level fusion according to the dimensions, wherein the multi-level fusion comprises three dimensions. Specifically, except +.>And remains unchanged. In the second dimension, the original +.>The sparse reference control points are completely used, and if severe bending occurs among the sparse points, good prediction cannot be realized on a small or severe bending mode. For dot set->Four nearest reference control point areas near any point are extracted to perform finer control point grid division, thereby realizing local complexityBending mode fine control point prediction. Whereas for the third dimension, the +.>The paired areas of each section on the curved document and the flat document are extracted, and additional grid division is carried out according to key points on text lines to form a new paired control point relation.
And 106, carrying out multi-level control point correction by guiding the local key points.
Based on the multi-level control point information provided in step 105, a reference sparse control point has been obtainedThe key to this step is to map the multi-level information in step 105 uniformly onto the same dimension.
First, the deformation key point is utilizedDeformation control point pair obtained by guidance +.>The more severe part of the upper bending mode (i.e. deformation gradient greater than threshold +.>Is included) is subjected to a first correction. The first correction refers to stacking the more refined control point grid cut in step 105 to sparse +.>And (3) upper part. Second, use of layout content key point +.>Content control Point pair derived from the guidance +.>The upper content portion is subjected to a second correction. The second correction method refers to the additional processing of the key points on the text line obtained in step 105Division of meshes and mapping these new meshes to +.>And (3) upper part. And through twice correction, the prediction of the control points of the curved document with explicit content and complex deformation sensing capability is realized.
And 107, processing the multi-level control points and outputting the corrected document.
Multi-level control point set obtained based on step 106First, a preliminary backward map is obtained by thin-plate spline interpolation (Thin Plate Spline)>Wherein 2 represents the XY two coordinate space dimensions. Then bilinear interpolation is applied to obtain the inverse map +.>. Next, according to the inverse mapping +.>Sampling pixel by using a pixel sampling method, and outputting a final corrected flat document +.>
The following is an example:
inputting a photo-curved document picture of a specific sizeAnd the picture is->Document image adjusted to uniform size +.>
In the reference sparse control point prediction module, first, an image is subjected toSlicing is performed in which the image is divided intoAnd the characteristic extraction is carried out through a convolutional neural network, and then the characteristic extraction is serialized into the size of the convolutional neural networkTensors of (c). The sequence comprises 961 token, the dimension of each token being +.>512. Adding absolute position information to each token by cosine position coding, wherein the size is still +.>. Next, 961 token->Is input Vision Transformer for global relational modeling. Six concatenated encoders, six concatenated decoders are included in this model, and a learnable query is used as a query input to the decoders. Features containing global dependency information between tokens obtained from the decoder remain +.>Is a size of (c) a. Finally, through a two-layer convolution network, the dimension change from 512 to 2 is realized, and a sparse reference control point is obtained>
In the 3D shape prediction module, a UNet network structure is adopted, and the output are respectivelyWhere the input channel dimension 3 represents the RBG pixel spatial domain and the output dimension 3 represents the XYZ coordinate spatial domain. Then extracting key points of deformation, and setting local deformation gradient +.>Threshold valueIn the range of approximately 0.5-0.85, portions above the threshold will be considered to have more complex bending modes, determined to be deformation keypoints. Finally 3D shape space to be predicted +.>Sparse sampling and 3D to 2D projection to obtain +.>Point set under tensor space. Wherein->Representing the number of deformation keypoints, 2 representing the XY coordinate space domain.
In the text line clustering module, a panorama segmentation network based on MaX-deep Lab is adopted as a whole, wherein the input isOutput is likewise->Point set under tensor space +.>Wherein->Representing the number of content keypoints, 4 represents the curved and flat paired XY coordinate space domain. The text lines are output in the form of masks, and for extracting the key points of the content, a horizontal midline of each mask is taken, and the key points are sampled on the midline. For text lines or paragraph outline key points, the application uniformly samples from left to right in a mode of about 10 to 15 pixels.
At the control point fusion part, the obtained deformation key pointsRequiring the nearest neighbors thereof to be determined separatelyFour reference control points, followed by a finer sampling within the reference grid, approximately +.>In the scope, thereby obtaining locally refined control points for solving locally complex bending modes, such as wrinkles. And the content key point obtained by the process of step 104 +.>Already in the form of pairs of control points and can thus be mapped directly to +.>Since the sampled text key points can be directly utilized, a new grid can be formed by connecting the text key points aligned horizontally or longitudinally, and encryption sampling is not needed like deformation key points.
In the multi-level control point correction part, according to the provided multi-level control point information. First, use is made of the deformation-based key points +.>The resulting refined mesh pair deformation control points->The first correction is performed. In the first correction, the finer control point grid cut part is directly superimposed to the whole uniform and sparse +.>And (3) upper part. Original->Only haveControl points, use->After the deformation control points are encrypted, a total of 16+25=41 composite control point sets can be obtained on the same 2D plane. Second, use of layout content key point +.>Control point->And performing second correction. Let->The coordinates of a key point of a certain text line are [ (4, 6, 1, 6), (16, 6 11, 6), (22, 7, 21, 6)]These control points can then be directly superimposed to +.>In terms of space, to achieve an originally uniform +.>This correction adds 3 control points, consistent with the number of key points of the text line.
Finally, the control points of the three parts are overlapped to obtainUniformly inputting the interpolation values into a post-processing module to perform Thin Plate Spline (TPS) interpolation to realize +.>Preliminary upsampling to obtain +.>. Then use bilinear interpolation pair +.>Further up-sampling is performed to obtain the inverse mapping tensor B of the same resolution as the original image. For example, the original photographed document size is (1600,900,3), and the high-resolution direction map obtained by bilinear interpolation should be (1600,900,2). Finally, according to the obtained reverse directionMapping B, calling grid_sample function in Pytorch frame to sample the input image pixel by pixel, thereby outputting final corrected flat document +.>
In summary, the present application introduces the concept of key points to solve two disadvantages existing in the existing paradigm. In the existing 2D deformation control point mapping learned by the paradigm, control points on a flat document paired with a curved document are presented as a uniform reference grid distribution in planar space, which makes the model unable to process the layout elements or local complex patterns of the document in a targeted manner. Therefore, the application provides a flexible joint solution for multi-level correction and complex bending mode correction of photographed documents by supplementing a learnable key point extraction module on the basis of the reference control points. Based on the improvement of the two points, the application fully utilizes the flexible modeling of the document layout elements and the local complex bending modes of the learnable key points, improves the mapping regression paradigm of the existing bending document control points, realizes better photographing document bending correction effect, and provides high-quality document images for downstream algorithms of document analysis, such as document recognition algorithms and the like.
Compared with the prior method for representing the curved document target by using only uniform control point mapping, the method introduces the learnable key point guidance for the first time to flexibly solve the problem of correcting the curved document. The learnable key points are not limited by the feature that the reference control points must be matched with the flat document, so that the parts, such as document layout elements, complex bending modes and the like, of the curved document, which cannot be accurately represented by the reference control points can be flexibly described. Compared with the existing curved document correction method based on document layout elements. The text line key point extraction method provided by the application is not limited to simply introducing a text line prediction module, but takes a multi-level document structure into consideration, and forms a clustering module from text lines to paragraphs from bottom to top. At the same time, such a module design method is easier to introduce richer constraint methods, such as inter-line relationships, inter-paragraph relationships, and the like. Compared with the traditional curved document correction method added with the 3D shape prediction branch auxiliary task, the method is not used as an auxiliary task to help the backbone network learn better characteristics, but is used for extracting deformation key points from the characteristics of relatively easy acquisition of complex bending modes from the 3D shape, and mapping the deformation key points into a 2D sparse control point space. The design enables the model to be capable of sensing the difficulty degree of the bending mode of the bending document in a self-adaptive mode, reasonably distributes the density degree of the required control points, and saves the predicted parameter quantity of the model.
For relevant details reference is made to the method embodiments described above.
Fig. 2 is a block diagram of an electronic device provided in one embodiment of the application. The device comprises at least a processor 401 and a memory 402.
Processor 401 may include one or more processing cores such as: 4 core processors, 8 core processors, etc. The processor 401 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 401 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 401 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 401 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the keypoint-based guide photographed document warp correction method provided by the method embodiments of the present application.
In some embodiments, the electronic device may further optionally include: a peripheral interface and at least one peripheral. The processor 401, memory 402, and peripheral interfaces may be connected by buses or signal lines. The individual peripheral devices may be connected to the peripheral device interface via buses, signal lines or circuit boards. Illustratively, peripheral devices include, but are not limited to: radio frequency circuitry, touch display screens, audio circuitry, and power supplies, among others.
Of course, the electronic device may also include fewer or more components, as the present embodiment is not limited in this regard.
Optionally, the present application further provides a computer readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the method for correcting bending of a photographed document based on key point guidance according to the above method embodiment.
Optionally, the present application further provides a computer product, where the computer product includes a computer readable storage medium, where a program is stored, and the program is loaded and executed by a processor to implement the method for correcting bending of a photographed document based on key point guidance according to the above method embodiment.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. A method for correcting a curvature of a photographed document based on key point guidance, the method comprising:
inputting a curved document picture;
constructing a reference sparse control point prediction module based on the ViT structure, and preliminarily encoding an input image into sequence features suitable for ViT processing by using an image slicer; providing absolute position coding information for the serialized image features by using cosine position coding after serialization, and finally realizing prediction of a reference control point based on a self-attention mechanism;
constructing a dense 3D shape prediction module based on the UNet structure, and extracting key points based on deformation gradients;
constructing a bottom-up text line clustering module based on a MaX-deep Lab network, and extracting content key points based on adjacent text lines;
based on the obtained deformation key points and content key points, aligning the key points into a 2D reference control point grid, and finally realizing multi-level control point fusion;
and correcting the multi-level control points under the guidance of the local key points, and outputting corrected documents.
2. The method for correcting curvature of a photographed document based on key point guidance according to claim 1, wherein the inputting the curved document picture comprises:
photographing document image of arbitrary size (H, W, 3)Document image adjusted to uniform size +.>Is recorded asThe method comprises the steps of carrying out a first treatment on the surface of the Wherein H is a photographed document image +.>W is the pixel height of the photographed document image +.>Is>Document image of uniform size->Pixel height, +.>Document image of uniform size->Is a pixel width of (a);
wherein 3 in the third dimension of the image size refers to the R, G, B three channels of the image; if a document image is photographedThe single-channel gray scale image is obtained by changing 3 into 1 or repeating the gray scale image three times.
3. The method for correcting curvature of a photographed document based on key point guidance according to claim 1, wherein the predicting of the reference control point based on the self-attention mechanism is finally achieved includes:
regarding the sequence units as the token, and obtaining the feature comprising global token relation modeling by calculating self-attention or mutual attention among the respective token in an encoder and a decoder; in the encoder section, several encoder blocks are used, each of which is stacked in cascade, and in each of which only the self-attention mechanism is used, for the token of each image block, Q, K and V, respectively;
performing further nonlinear transformation through the full connection layer to obtain the output of a single encoder block;
through cascade coding of a plurality of encoder blocks, K and V output by the encoder are input as one part of the decoder, and a randomly initialized learnable query is used as the other part of the decoder;
in each decoder block, firstly, obtaining and extracting relation features among the learnable queries by using a self-attention mechanism, secondly, obtaining mutual attention of K and V obtained in the encoder and the learnable queries extracted by the self-attention relation, and finally, similarly, obtaining single decoder output through a full connection layer;
finally, the prediction of the reference sparse control point can be realized through the dimension transformation of a plurality of layers at the output head of the decoder.
4. The method for correcting curvature of a photographed document based on key point guidance according to claim 1, wherein the extracting key points based on deformation gradient comprises:
the gradient of the 3D shape along the horizontal direction is obtained, and the part with complex bending mode is extracted according to the set threshold value and is set as the deformation key point.
5. The keypoint-guide-based shot document warp correction method as claimed in claim 1, wherein said extracting content keypoints based on adjacent text lines comprises:
panoramic segmentation is carried out on the photographed document, and masks of different text lines on the curved image and the flat image are obtained;
meanwhile, a plurality of learnable queries are added to the input part of the network and used for calculating an affinity matrix with masks of different text lines, and finally the text lines are clustered into respective paragraph classes;
and extracting the central points of the obtained text lines and the masks of the section outline to form key points of the layout elements of the multi-level document.
6. The method for correcting curvature of photographed documents based on key point guidance according to claim 1, wherein aligning the key points to a 2D reference control point grid based on the obtained deformed key points and content key points, and finally realizing multi-level control point fusion comprises:
for the deformation key points, firstly projecting the sparse sampled deformation points mapped to the 3D space to the corresponding positions on the 2D reference sparse control point grids;
for the content key points, the content key points are directly mapped to the space of the reference sparse control points.
7. The method for correcting curvature of a photographed document based on key point guidance according to claim 1, wherein correcting the multi-level control point under the local key point guidance comprises:
the deformation control points obtained by guiding are utilized to carry out first correction on the parts with serious bending modes on the reference sparse control points;
and secondly, performing secondary correction on the content part on the reference sparse control point by using the content control point obtained by the guidance of the layout content key point.
8. The method for correcting curvature of a photographed document based on key point guidance according to claim 1, wherein the outputting the corrected document is preceded by:
based on the obtained multi-level control point set, firstly adopting thin plate spline interpolation to obtain preliminary backward mapping;
and then, performing bilinear interpolation to obtain reverse mapping with the same resolution as the original image, and then performing pixel-by-pixel sampling according to the reverse mapping by a pixel sampling method to output a final corrected flat document.
9. An electronic device comprising a processor and a memory; the memory stores therein a program loaded and executed by the processor to implement a shot document curvature correction method based on the key point guidance as set forth in any one of claims 1 to 8.
10. A computer-readable storage medium, wherein a program is stored in the storage medium, which when executed by a processor is configured to implement a method for correcting curvature of a photographed document based on key point guidance as set forth in any one of claims 1 to 8.
CN202311017033.2A 2023-08-14 2023-08-14 Photographing document bending correction method and device based on key point guidance Active CN116740720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311017033.2A CN116740720B (en) 2023-08-14 2023-08-14 Photographing document bending correction method and device based on key point guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311017033.2A CN116740720B (en) 2023-08-14 2023-08-14 Photographing document bending correction method and device based on key point guidance

Publications (2)

Publication Number Publication Date
CN116740720A true CN116740720A (en) 2023-09-12
CN116740720B CN116740720B (en) 2023-10-27

Family

ID=87911822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311017033.2A Active CN116740720B (en) 2023-08-14 2023-08-14 Photographing document bending correction method and device based on key point guidance

Country Status (1)

Country Link
CN (1) CN116740720B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767270A (en) * 2021-01-19 2021-05-07 中国科学技术大学 Fold document image correction system
CN113822276A (en) * 2021-09-30 2021-12-21 中国平安人寿保险股份有限公司 Image correction method, device, equipment and medium based on neural network
CN114202648A (en) * 2021-12-08 2022-03-18 北京百度网讯科技有限公司 Text image correction method, training method, device, electronic device and medium
CN115082935A (en) * 2022-07-04 2022-09-20 网易有道信息技术(北京)有限公司 Method, apparatus and storage medium for correcting document image

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767270A (en) * 2021-01-19 2021-05-07 中国科学技术大学 Fold document image correction system
CN113822276A (en) * 2021-09-30 2021-12-21 中国平安人寿保险股份有限公司 Image correction method, device, equipment and medium based on neural network
CN114202648A (en) * 2021-12-08 2022-03-18 北京百度网讯科技有限公司 Text image correction method, training method, device, electronic device and medium
CN115082935A (en) * 2022-07-04 2022-09-20 网易有道信息技术(北京)有限公司 Method, apparatus and storage medium for correcting document image

Also Published As

Publication number Publication date
CN116740720B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN106780512B (en) Method, application and computing device for segmenting image
CN109829930B (en) Face image processing method and device, computer equipment and readable storage medium
CN112560980B (en) Training method and device of target detection model and terminal equipment
CN112819947A (en) Three-dimensional face reconstruction method and device, electronic equipment and storage medium
CN111161269B (en) Image segmentation method, computer device, and readable storage medium
KR20120021149A (en) Image correcting apparatus, correction image generating method, correction table generating apparatus, correction table generating method, computer readable recording medium storing a correction table generating program, computer readable recording medium storing a correction image generating program
CN113674146A (en) Image super-resolution
CN112233125A (en) Image segmentation method and device, electronic equipment and computer readable storage medium
CN111862124A (en) Image processing method, device, equipment and computer readable storage medium
WO2023202283A1 (en) Image generation model training method and apparatus, image generation method and apparatus, and device
CN113256529A (en) Image processing method, image processing device, computer equipment and storage medium
CN111612068A (en) Image annotation method and device, computer equipment and storage medium
CN114511449A (en) Image enhancement method, device and computer readable storage medium
CN114298900A (en) Image super-resolution method and electronic equipment
CN115082935A (en) Method, apparatus and storage medium for correcting document image
CN112597940B (en) Certificate image recognition method and device and storage medium
CN110222741A (en) Prediction technique, model, device, equipment and the storage medium of medical image
CN116740720B (en) Photographing document bending correction method and device based on key point guidance
Huang et al. Fast hole filling for view synthesis in free viewpoint video
CN108241861A (en) A kind of data visualization method and equipment
CN112348008A (en) Certificate information identification method and device, terminal equipment and storage medium
Jin et al. Light field reconstruction via deep adaptive fusion of hybrid lenses
US11145037B1 (en) Book scanning using machine-trained model
CN113298702B (en) Reordering and segmentation method based on large-size image pixel points
CN111583168A (en) Image synthesis method, image synthesis device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant