CN116912865A - Form image recognition method, device, equipment and medium - Google Patents

Form image recognition method, device, equipment and medium Download PDF

Info

Publication number
CN116912865A
CN116912865A CN202211528361.4A CN202211528361A CN116912865A CN 116912865 A CN116912865 A CN 116912865A CN 202211528361 A CN202211528361 A CN 202211528361A CN 116912865 A CN116912865 A CN 116912865A
Authority
CN
China
Prior art keywords
character
determining
position information
semantic segmentation
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211528361.4A
Other languages
Chinese (zh)
Inventor
郑婕
张晓川
张湛梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Guangdong Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Guangdong Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Guangdong Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202211528361.4A priority Critical patent/CN116912865A/en
Publication of CN116912865A publication Critical patent/CN116912865A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/15Cutting or merging image elements, e.g. region growing, watershed or clustering-based techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/162Quantising the image signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Character Input (AREA)

Abstract

The application provides a method, a device, equipment and a medium for identifying a form image, which comprise the following steps: carrying out table line pixel marking on the table image to be identified to obtain a semantic segmentation binarization map; sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map, and determining position coordinates of table grid lines; determining a start-stop row of the cells based on the position coordinates of the table grid lines; performing single character position segmentation on the table image to be identified, and determining character position information, wherein the character position information represents the position information of each character; and based on the start-stop line and the character position information, aggregating the cells and the corresponding text content to generate a table file. The application is used for solving the defect of inaccurate form detection and positioning in the prior art and improving the accuracy of form detection and positioning.

Description

Form image recognition method, device, equipment and medium
Technical Field
The present application relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a device, and a medium for recognizing a form image.
Background
The form is a common means for displaying formatted data, and is widely used in the fields of scientific research, data analysis and the like. With the increasing level of informatization, locating and obtaining desired contents from non-editable forms such as paper becomes a pain point to be solved.
In the method for solving the editable form, the most widely applied method is form detection, form cell detection and form structure identification. The form detection is to detect the area where the form is located from the file such as the image; the table cell detection is to identify all cells appearing in the table, including merging cells and the like; the table structure identifies, i.e. parses, the table area, extracts text content and structural information in the table, and obtains the distribution of rows and columns and the logical structure between cells, also referred to as table document reconstruction.
As a general data organization structure, the form has no specific specification, so the style and the variety are not uniform, and the difficulty of identifying tasks is greatly increased. In addition, form recognition is always a research difficulty in the field of document recognition due to interference in various situations such as filling of different colors, different text types in each column, different image types (screenshot, pdf, photographing) and the like.
The first type of recognition method is a traditional image processing method, and is obtained by extracting table lines by means of image morphological transformation, texture extraction, edge detection and the like, and deriving information of the last row, column and cell from the table lines. However, this method is too dependent on conventional image processing algorithms, which require thresholding, resulting in a set of parameters being valid for that batch of data and invalid for another batch of data, and thus is less robust and not accurate for tables without visible lines.
The second type of recognition method is to derive information of rows, columns and cells according to position information of an OCR text detection box by means of OCR technology, so as to generate a spreadsheet. However, this method relies heavily on OCR detection results and manually designed rules, and the form detection positioning is not accurate enough.
The third type of method is neural network end-to-end learning, which uses image-to-text sequence techniques to convert form images into some structured language such as HTML tags. However, this approach requires that the sequence length not be too long, so for a dense cell form, the solution works poorly and the form detection and positioning is not accurate enough.
Therefore, there is a need for a form image recognition method that improves form detection and positioning accuracy.
Disclosure of Invention
The application provides a method, a device, equipment and a medium for identifying a form image, which are used for solving the defect that the form detection and positioning are not accurate enough in the prior art and improving the accuracy of the form detection and positioning.
The application provides a form image recognition method, which comprises the following steps:
carrying out table line pixel marking on the table image to be identified to obtain a semantic segmentation binarization map;
sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map, and determining position coordinates of table lines;
determining a start-stop row of the cells based on the position coordinates of the table grid lines;
performing single character position segmentation on the to-be-identified form image, and determining character position information, wherein the character position information represents the position information of each character;
and based on the start-stop rows and the character position information, aggregating the cells and the corresponding text content to generate a table file.
According to the table image recognition method provided by the application, table line extraction and table line position determination are carried out on the semantic segmentation binarization graph, and the position coordinates of the table line are determined, which comprises the following steps:
extracting a connected region in the semantic segmentation binarization map based on the semantic segmentation binarization map;
fitting the frame lines of the connected areas to the connected areas, and determining the grid lines of the table to be identified;
and determining the position coordinates of the table grid lines based on the minimum circumscribed rectangle corresponding to the table lines to be identified.
According to the method for identifying the form image, which is provided by the application, the form image to be identified is subjected to single character position segmentation, character position information is determined, and the character position information characterizes the position information of each character and comprises the following steps:
inputting the form image to be identified into a convolutional cyclic neural network to obtain a probability matrix of the length of a sequence vector and the number of characters output by the convolutional cyclic neural network;
the probability matrix is used for determining character segmentation positions to obtain character position information; if the character belongs to the Chinese character, determining an index of a first empty sequence vector as a starting index of the Chinese character, wherein the starting index of the Chinese character is used for determining a starting coordinate of the Chinese character; if the character belongs to other characters except the Chinese character, determining an index of a second empty sequence vector as a starting index of the other characters, wherein the starting index of the other characters is used for determining starting coordinates of the other characters; the start coordinates are used to determine a character segmentation location.
According to the method for identifying the table image, which is provided by the application, the unit cells and the corresponding text contents are aggregated based on the start-stop rows and the character position information to generate the table file, and the method comprises the following steps:
inputting the start-stop rows and the character position information into an aggregation model, and aggregating a character frame and a unit grid frame to generate a table file;
the aggregation model is used for matching text contents with corresponding cells, and an algorithm applied by the aggregation model comprises a center rule and a maximum IOU rule.
According to the method for identifying the table image, which is provided by the application, the image correction is carried out on the semantic segmentation binarization graph, and the method comprises the following steps:
inputting the semantic segmentation binarization map to a rotation target detection model of a corner regression algorithm, and determining a table area and table corners of the semantic segmentation binarization map;
the table area and the table corner are used for carrying out image correction on the semantic segmentation binarization map.
According to the table image recognition method provided by the application, the rotation target detection model is an improved network based on a YOLOv5 network, and the network structure of the improved network is that an angle class branch and a key point regression branch are added on the structure of the YOLOv5 network.
The application also provides a form image recognition device, which comprises:
the semantic segmentation module is used for marking the form line pixels of the form image to be identified to obtain a semantic segmentation binarization image;
the table line position determining module is used for sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map and determining position coordinates of table lines;
the cell determining module is used for determining a start-stop row of the cells based on the position coordinates of the table grid lines;
the character segmentation module is used for carrying out single character position segmentation on the to-be-identified form image and determining character position information, wherein the character position information represents the position information of each character;
and the aggregation module is used for aggregating the unit cells and the corresponding text contents based on the start-stop rows and the character position information to generate a table file.
The application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the form image recognition method according to any one of the above when executing the program.
The present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a tabular image identification method as described in any of the above.
The application also provides a computer program product comprising a computer program which when executed by a processor implements a tabular image identification method as described in any of the above.
According to the method, the device, the equipment and the medium for recognizing the table image, the semantic segmentation binarization image is obtained through semantic segmentation of the image to be recognized, image correction, table line extraction and cell position determination are completed based on the semantic segmentation binarization image, then table structure recognition and recognition of texts in the table are performed, finally the text contents and the cells are aggregated to obtain the table file, and the table structure recognition and the text position recognition are performed respectively after the semantic segmentation, so that the accuracy of table detection and positioning is improved, and on the other hand, the efficient conversion from the table image to the table file is realized, and the office efficiency of a user is practically improved.
Drawings
In order to more clearly illustrate the application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a form image recognition method provided by the application;
FIG. 2 is a second flow chart of a form image recognition method according to the present application;
FIG. 3 is a third flow chart of a form image recognition method according to the present application;
FIG. 4 is a flowchart of a method for identifying a form image according to the present application;
FIG. 5 is a flowchart of a method for identifying a form image according to the present application;
fig. 6 is a schematic structural diagram of an electronic device provided by the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The form image recognition method of the present application is described below with reference to fig. 1 to 5.
Referring to fig. 1 and 2, the method for identifying a table image according to the present application includes:
step 10, carrying out table line pixel marking on a table image to be identified to obtain a semantic segmentation binarization map;
in the embodiment of the application, the form image to be identified is obtained, wherein the form image to be identified is used for form image identification, and the form image to be identified supports common PNG, JPG, JPEG and other picture formats and PDF texts.
It should be noted that, an innovative table detection model is established, a semantic segmentation binarization map is established, the table detection is completed through the table detection model, specifically, a table image to be identified is input into the table detection model, and the table image to be identified is subjected to the pixel marking of the table line through the table detection model, so that the table detection is completed, and the semantic segmentation binarization map is obtained, wherein the semantic segmentation binarization map comprises the pixel marking of the table grid line.
Step 20, sequentially performing image correction, table line extraction and table line position determination on the semantic segmentation binarization map, and determining position coordinates of table lines;
obtaining a binary image containing semantic segmentation, establishing an innovative semantic segmentation model, and performing geometric analysis on a segmentation result, namely the binary image containing semantic segmentation, to realize image correction, table line extraction and cell position coordinate determination, wherein the geometric analysis operation comprises the following steps: image correction, table line extraction, and table line position determination. Specifically, according to the pixel marks of the table lines in the semantic segmentation binarization graph, the table lines in the semantic segmentation binarization graph are determined, and then the table lines are corrected according to the table lines, and as the table images of the trapezoid shapes are usually inclined or trapezoid, the table lines are extracted and the positions of the table lines are determined after the table images of the trapezoid shapes are corrected, so that the recognition effect is improved. And then, according to the corrected table line pixel marks, table line extraction and table line position determination are carried out, the position coordinates of the table line are determined, and finally the table line positions are extracted. Wherein, the position coordinates of the table grid lines are used for representing the positions of the table lines.
In order to achieve efficient and accurate semantic segmentation effects, an innovative ErfNet real-time semantic segmentation model can be adopted as an innovative semantic segmentation model, and for the design of a backbox, reference is made to a mobilent, and depthwise+pointwise is used to replace conventional convolution. The grid lines are elongated objects, and a larger receptive field in the transverse and vertical directions would provide further benefits. Therefore, the selected convolution kernel shapes are 5x1 and 1x5, better performance can be achieved in actual measurement than the common 3x3, and in addition, the situation that labels are not mutually exclusive exists in a table line segmentation scene, so that sigmoid is adopted to replace softmax to realize multi-label classification to obtain a semantic segmentation binarization map. When approaching positive infinity or negative infinity, the sigmoid function approaches a smooth state, and the sigmoid function can be used in multi-label classification because the output is between 0 and 1.
In view of the unbalance of the number of various pixels, the loss function adopts weighted cross entropy, diceLoss and FocalLoss, and the weight ratio is 1.0, 3.0 and 5.0 respectively. In order to make the model obtain better performance or faster convergence, setting the learning rate of the decoding head assembly to be 10 times of that of the backbone network; in addition, on-line hard sample mining (Online Hard Example Mining, OHEM) is provided, and only pixel value points with confidence scores below 0.7 can be used for training. To increase the generalization capability of the model, random inversion, rotation, photometric distortion and geometric distortion transformation are performed on the training data.
Step 30, determining the start-stop rows of the cells based on the position coordinates of the grid lines;
and establishing a TSRA table structure reasoning algorithm model, wherein the TSRA table structure reasoning algorithm model is used for identifying the table structure. And (3) inputting the position coordinates of the table grid lines into a TSRA table structure reasoning algorithm model, recognizing the table structure according to the position of the table lines through the TSRA table structure reasoning algorithm model, and deducing the start-stop rows and the stop rows of the cells from the table lines to complete the recognition of the table structure so as to obtain the start-stop rows and the stop rows of the cells. Wherein the start-stop rows of cells are used to characterize the table structure.
The existing table structure reasoning algorithm has the situation that reasoning is easy to make mistakes for the tables of the merging cells, so a set of efficient algorithm is provided to deduce the start-stop rows and the stop rows of the cells from the table lines, and then the table structure is analyzed.
The method comprises the steps of carrying out table reconstruction based on cell position information, firstly deducing a start-stop column of cells, extracting start-stop x coordinates (xmin, xmax) of the upper left corner and the lower right corner of all cells, then sorting from small to large, and making a mapping dictionary edgesMap { 'true x value': 'theoretical x value' }, wherein the aim is that two cells which are up and down are possibly separated under a true sample, the xmin or xmax of the two cells have a difference, but are theoretically the same value, so that a threshold value needs to be set, and if the difference of the two x is larger than the threshold value, the two x are not considered to be the same theoretical value, otherwise, the two x are considered to be the same theoretical value. A list storing a series of theoretical values is obtained from the mapping dictionary, ordered from small to large and processed into the following data structure edgessmapindex { theoretical x0:0, theoretical x1:0, theoretical x2:1, theoretical x3:2 … }. And finally traversing all the cells, taking the start-stop x coordinates of the upper left corner and the lower right corner of the cells, firstly obtaining a theoretical value according to the edgesMap, and then obtaining a start-stop column according to the edgesMapIndex. Similarly, the start and stop line information of the cell can be deduced. The table structure reasoning algorithm can be suitable for various scenes of merging cells across rows and columns, and is high in algorithm speed and good in generalization.
Step 40, performing single character position segmentation on the form image to be identified, and determining character position information, wherein the character position information represents the position information of each character;
and establishing a character position recognition model, wherein the character position recognition model is used for carrying out single character position segmentation on an input form image, determining character position information, carrying out character recognition and recognizing character positions, and determining each character and the position of each character of the input image, wherein the character position recognition model is an improved character recognition model, can be used for recognizing characters and can also be used for recognizing the positions of the characters. Specifically, a form image to be recognized is input into a character position recognition model to determine character position information through the character position recognition model,
and step 50, aggregating the cells and the corresponding text contents based on the start-stop lines and the character position information to generate a table file.
And establishing an aggregation model of the text content and the cells, and realizing the matching of the text content and the cells. After the table structure and the character position information are identified, the start-stop line and character position information of the cells corresponding to the table structure are input into an aggregation model, the aggregation model aggregates text contents and the cells according to the start-stop line and character position information of the cells, matching of the text contents and the cells is achieved, and therefore a table file is generated, and conversion from a table image to the table file, which is also called table file reconstruction, is achieved through the steps. The format of the generated table file may include an excel table, a word file, or any other format, and the generated table file is not limited in the embodiment of the present application.
According to the table image recognition method provided by the application, the semantic segmentation binarization map is obtained by carrying out semantic segmentation on the image to be recognized, the image correction, the table line extraction and the cell position determination are completed based on the semantic segmentation binarization map, then the table structure recognition and the recognition of texts in the table are carried out, finally the text contents and the cells are aggregated to obtain the table file, and the table structure recognition and the text position recognition are respectively carried out after the semantic segmentation, so that the accuracy of table detection and positioning is improved, and on the other hand, the efficient conversion from the table image to the table file is realized, and the office efficiency of a user is practically improved.
In one embodiment, referring to fig. 3, step 20 of extracting table lines and determining table line positions of the semantic segmentation binarized graph, determining position coordinates of the table lines includes:
step 201, extracting connected areas in the semantic segmentation binarization map based on the semantic segmentation binarization map;
step 202, fitting a frame line of the connected region to the connected region, and determining a table grid line to be identified;
and 203, determining the position coordinates of the grid lines based on the minimum circumscribed rectangle corresponding to the to-be-identified grid line.
In this embodiment, geometric analysis is performed based on the corrected semantic segmentation binarized graph, specifically, the method includes the steps of firstly extracting a connected region, then fitting the connected region to form a frame line, and then calculating a minimum circumscribed rectangle to finally obtain position coordinates of all cells.
In this embodiment, since the table line pixels in the semantic segmentation binarization map are not necessarily continuous, the position coordinates of the table lines of all the cells are determined by calculating the minimum bounding rectangle for the connected region, so that the accuracy of the position calculation of the table lines of the cells is improved, and the accuracy of the table identification and positioning of the table image is further improved.
In one embodiment, step 40, performing single character position segmentation on the to-be-identified table image to determine character position information, includes:
inputting the form image to be identified into a convolutional cyclic neural network to obtain a probability matrix of the length of a sequence vector and the number of characters output by the convolutional cyclic neural network;
the probability matrix is used for determining character segmentation positions to obtain character position information; if the character belongs to the Chinese character, determining an index of a first empty sequence vector as a starting index of the Chinese character, wherein the starting index of the Chinese character is used for determining a starting coordinate of the Chinese character; if the character belongs to other characters except the Chinese character, determining an index of a second empty sequence vector as a starting index of the other characters, wherein the starting index of the other characters is used for determining starting coordinates of the other characters; the start coordinates are used to determine a character segmentation location.
The existing ocr model lacks position information of character level, firstly, a character is projected and segmented, and a distribution histogram of pixels of a binarized picture is utilized for analysis, so that boundary points of adjacent characters are found for segmentation; secondly, segmenting characters by the connected domain, firstly binarizing, performing image open operation, removing tiny noise, then marking the connected domain connected in an eight-domain mode in the image, calculating frame line characteristics of the connected domain region, including upper left corner coordinates and length and width, and segmenting the characters according to the data.
The character segmentation step by adopting a CTC decoding strategy is as follows: the convolution cyclic neural network outputs a probability matrix of sequence vector length multiplied by the number of characters, wherein the sequence vectors respectively correspond to the original image with a certain stride, indexes of non-empty sequence vectors can be obtained in a CTC stage, for Chinese characters, the indexes of the first empty sequence vectors can be used as initial indexes of the Chinese characters, and other characters adopt indexes of the second empty sequence vectors, so that initial coordinates of other characters are obtained by inference.
The first empty sequence vector is related to the Chinese character, and is the first plurality of empty sequences of the sequence vector corresponding to the Chinese character, and is used for adjusting the positioning of the position of the Chinese character. The second null sequence vector is related to other characters, and the second null sequence vector is a plurality of null sequences of the sequence vector corresponding to the other characters and is used for adjusting the positioning of the positions of the other characters. The determining of the first and second null sequence vectors may include determining a length of the identification sequence vector and a type of the sequence vector, and determining the first and second null sequence vectors of the sequence vector based on the length and type of the sequence vector. If the length of the sequence vector exceeds a set threshold (generally 1), the empty sequence vector is +0.5, otherwise, the sequence vector is unchanged; if the type of sequence vector is a Chinese sequence, then the null sequence vector +1. For example, if the sequence vector length of the chinese character exceeds 1, the first empty sequence vector corresponding to the chinese character is the previous 1.5 empty sequence vectors; the sequence vector length of the other characters exceeds 1, and the second null sequence vector may be the first 0.5 null sequence vectors.
In this embodiment, different indexing strategies are adopted for different character types, so that the character segmentation position is more accurate, the character segmentation accuracy is improved, and the character position recognition accuracy is further improved.
In addition, the CTC decoding part is realized by adopting a numpy array, so that the speed of CTC decoding is further increased.
In one embodiment, referring to fig. 4, step 50, based on the start-stop line and the character position information, aggregating the cells and the corresponding text content to generate a table file includes:
step 51, determining an aggregation model of text content and cells, wherein the aggregation model is used for matching text content and the corresponding cells;
step 52, inputting the start-stop line and the character position information into an aggregation model, and aggregating a character frame and a unit grid frame to generate a table file;
the algorithm applied by the aggregation model comprises a central rule and a maximum IOU rule.
And establishing an aggregation model of the text content and the cells, and realizing the matching of the text content and the cells. After the table structure and the character position information are identified, the start-stop line and character position information of the cells corresponding to the table structure are input into an aggregation model, the aggregation model aggregates text contents and the cells according to the start-stop line and character position information of the cells, matching of the text contents and the cells is achieved, and therefore a table file is generated, and conversion from a table image to the table file, which is also called table file reconstruction, is achieved through the steps.
In this embodiment, the text in each cell can be obtained by aggregating the character frame and the cell frame according to the central rule and the maximum IOU rule.
In one embodiment, referring to fig. 5, step 20, performing image correction on the semantic segmentation binarized map includes:
step 211, determining a rotation target detection model of a corner regression algorithm;
step 212, inputting the semantic segmentation binarization map to the rotation target detection model, and determining a table area and a table corner point of the semantic segmentation binarization map;
the table area and the table corner are used for carrying out image correction on the semantic segmentation binarization map.
In this embodiment, a rotating target detection model with a corner regression algorithm is provided, and four corners of a table area and a table can be detected simultaneously by using the rotating target detection model with the corner regression algorithm, so that perspective transformation realizes image correction.
In one embodiment, the rotating object detection model is an improved network based on a YOLOv5 network with a network structure that adds an angle class branch and a keypoint regression branch to the structure of the YOLOv5 network.
Form inspection, i.e. inspecting the form frame, may have more than one form on the same picture, so a form inspection model is required to perform form inspection. In a real scene, the picture may have inclination, distortion and the like, so that the applicability of the non-rotating target detection algorithm is not good (the non-rotating target detection uses a parallelogram frame as a detection frame to frame the position of the target). In addition, the conventional table algorithm lacks a corner regression module and cannot well position the corner position information of the table, so that perspective transformation correction images cannot be performed.
The rotary target detection model with the corner regression algorithm is favorable for accurately positioning the table, and provides more accurate initial results for subsequent high-order tasks such as identification, analysis and the like. The model adopts a rotational target detection algorithm based on the YOLOv5 improvement, and the main implementation mode is to add an angle class branch and a key point regression branch on the original YOLOv5 structure to realize rotational target detection and corner regression, and can replace L1 loss with Wing loss to enable the key point regression to be more accurate.
The table image recognition device provided by the application is described below, and the table image recognition device described below and the table image recognition method described above can be referred to correspondingly.
The application provides a form image recognition device, comprising:
the semantic segmentation module is used for marking the form line pixels of the form image to be identified to obtain a semantic segmentation binarization image;
the table line position determining module is used for sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map and determining position coordinates of table lines;
the cell determining module is used for determining a start-stop row of the cells based on the position coordinates of the table grid lines;
the character segmentation module is used for carrying out single character position segmentation on the to-be-identified form image and determining character position information, wherein the character position information represents the position information of each character;
and the aggregation module is used for aggregating the unit cells and the corresponding text contents based on the start-stop rows and the character position information to generate a table file.
Further, the table line position determining module is further configured to:
extracting a connected region in the semantic segmentation binarization map based on the semantic segmentation binarization map;
fitting the frame lines of the connected areas to the connected areas, and determining the grid lines of the table to be identified;
and determining the position coordinates of the table grid lines based on the minimum circumscribed rectangle corresponding to the table lines to be identified.
Further, the character segmentation module is further configured to:
inputting the form image to be identified into a convolutional cyclic neural network to obtain a probability matrix of the length of a sequence vector and the number of characters output by the convolutional cyclic neural network;
the probability matrix is used for determining character segmentation positions to obtain character position information; if the character belongs to the Chinese character, determining an index of a first empty sequence vector as a starting index of the Chinese character, wherein the starting index of the Chinese character is used for determining a starting coordinate of the Chinese character; if the character belongs to other characters except the Chinese character, determining an index of a second empty sequence vector as a starting index of the other characters, wherein the starting index of the other characters is used for determining starting coordinates of the other characters; the start coordinates are used to determine a character segmentation location.
Further, the aggregation module is further configured to:
inputting the start-stop rows and the character position information into an aggregation model, and aggregating a character frame and a unit grid frame to generate a table file;
the aggregation model is used for matching text contents with corresponding cells, and an algorithm applied by the aggregation model comprises a center rule and a maximum IOU rule.
Further, the table line position determining module is further configured to:
inputting the semantic segmentation binarization map to a rotation target detection model of a corner regression algorithm, and determining a table area and table corners of the semantic segmentation binarization map;
the table area and the table corner are used for carrying out image correction on the semantic segmentation binarization map.
Further, the rotation target detection model is an improved network based on a YOLOv5 network, and the network structure of the improved network is that an angle class branch and a key point regression branch are added on the structure of the YOLOv5 network.
Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a form image recognition method comprising: carrying out table line pixel marking on the table image to be identified to obtain a semantic segmentation binarization map; sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map, and determining position coordinates of table lines; determining a start-stop row of the cells based on the position coordinates of the table grid lines; performing single character position segmentation on the to-be-identified form image, and determining character position information, wherein the character position information represents the position information of each character; and based on the start-stop rows and the character position information, aggregating the cells and the corresponding text content to generate a table file.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present application also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a method of identifying a form image provided by the methods described above, the method comprising: carrying out table line pixel marking on the table image to be identified to obtain a semantic segmentation binarization map; sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map, and determining position coordinates of table lines; determining a start-stop row of the cells based on the position coordinates of the table grid lines; performing single character position segmentation on the to-be-identified form image, and determining character position information, wherein the character position information represents the position information of each character; and based on the start-stop rows and the character position information, aggregating the cells and the corresponding text content to generate a table file.
In yet another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of identifying a form image provided by the above methods, the method comprising: carrying out table line pixel marking on the table image to be identified to obtain a semantic segmentation binarization map; sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map, and determining position coordinates of table lines; determining a start-stop row of the cells based on the position coordinates of the table grid lines; performing single character position segmentation on the to-be-identified form image, and determining character position information, wherein the character position information represents the position information of each character; and based on the start-stop rows and the character position information, aggregating the cells and the corresponding text content to generate a table file.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A form image recognition method, comprising:
carrying out table line pixel marking on the table image to be identified to obtain a semantic segmentation binarization map;
sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map, and determining position coordinates of table lines;
determining a start-stop row of the cells based on the position coordinates of the table grid lines;
performing single character position segmentation on the to-be-identified form image, and determining character position information, wherein the character position information represents the position information of each character;
and based on the start-stop rows and the character position information, aggregating the cells and the corresponding text content to generate a table file.
2. The method of claim 1, wherein performing table-line extraction and table-line position determination on the semantic segmentation binarized map, determining position coordinates of table-line, comprises:
extracting a connected region in the semantic segmentation binarization map based on the semantic segmentation binarization map;
fitting the frame lines of the connected areas to the connected areas, and determining the grid lines of the table to be identified;
and determining the position coordinates of the table grid lines based on the minimum circumscribed rectangle corresponding to the table lines to be identified.
3. The method for recognizing a form image according to claim 1, wherein the step of performing single character position segmentation on the form image to be recognized to determine character position information includes:
inputting the form image to be identified into a convolutional cyclic neural network to obtain a probability matrix of the length of a sequence vector and the number of characters output by the convolutional cyclic neural network;
the probability matrix is used for determining character segmentation positions to obtain character position information; if the character belongs to the Chinese character, determining an index of a first empty sequence vector as a starting index of the Chinese character, wherein the starting index of the Chinese character is used for determining a starting coordinate of the Chinese character; if the character belongs to other characters except the Chinese character, determining an index of a second empty sequence vector as a starting index of the other characters, wherein the starting index of the other characters is used for determining starting coordinates of the other characters; the start coordinates are used to determine a character segmentation location.
4. The form image recognition method according to claim 1, wherein aggregating the cells and corresponding text content based on the start-stop line and the character position information to generate a form file includes:
inputting the start-stop rows and the character position information into an aggregation model, and aggregating a character frame and a unit grid frame to generate a table file;
the aggregation model is used for matching text contents with corresponding cells, and an algorithm applied by the aggregation model comprises a center rule and a maximum IOU rule.
5. The method of claim 1, wherein performing image correction on the semantically segmented binarized map comprises:
inputting the semantic segmentation binarization map to a rotation target detection model of a corner regression algorithm, and determining a table area and table corners of the semantic segmentation binarization map;
the table area and the table corner are used for carrying out image correction on the semantic segmentation binarization map.
6. The form image recognition method of claim 5, wherein the rotation target detection model is a modified network based on YOLOv5 network, and the network structure of the modified network is adding an angle class branch and a key point regression branch to the structure of YOLOv5 network.
7. A form image recognition apparatus, comprising:
the semantic segmentation module is used for marking the form line pixels of the form image to be identified to obtain a semantic segmentation binarization image;
the table line position determining module is used for sequentially carrying out image correction, table line extraction and table line position determination on the semantic segmentation binarization map and determining position coordinates of table lines;
the cell determining module is used for determining a start-stop row of the cells based on the position coordinates of the table grid lines;
the character segmentation module is used for carrying out single character position segmentation on the to-be-identified form image and determining character position information, wherein the character position information represents the position information of each character;
and the aggregation module is used for aggregating the unit cells and the corresponding text contents based on the start-stop rows and the character position information to generate a table file.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the tabular image identification method as claimed in any one of claims 1 to 6 when executing the program.
9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the form image recognition method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the tabular image identification method as claimed in any one of claims 1 to 6.
CN202211528361.4A 2022-11-30 2022-11-30 Form image recognition method, device, equipment and medium Pending CN116912865A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211528361.4A CN116912865A (en) 2022-11-30 2022-11-30 Form image recognition method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211528361.4A CN116912865A (en) 2022-11-30 2022-11-30 Form image recognition method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN116912865A true CN116912865A (en) 2023-10-20

Family

ID=88361487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211528361.4A Pending CN116912865A (en) 2022-11-30 2022-11-30 Form image recognition method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN116912865A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173725A (en) * 2023-11-03 2023-12-05 之江实验室 Table information processing method, apparatus, computer device and storage medium
CN118379753A (en) * 2024-06-25 2024-07-23 万村联网数字科技有限公司 Method and system for extracting bad asset contract key information by utilizing OCR technology

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173725A (en) * 2023-11-03 2023-12-05 之江实验室 Table information processing method, apparatus, computer device and storage medium
CN117173725B (en) * 2023-11-03 2024-04-09 之江实验室 Table information processing method, apparatus, computer device and storage medium
CN118379753A (en) * 2024-06-25 2024-07-23 万村联网数字科技有限公司 Method and system for extracting bad asset contract key information by utilizing OCR technology

Similar Documents

Publication Publication Date Title
Tang et al. Scene text detection using superpixel-based stroke feature transform and deep learning based region classification
CN110647829A (en) Bill text recognition method and system
CN109740606B (en) Image identification method and device
CN116912865A (en) Form image recognition method, device, equipment and medium
CN111401353B (en) Method, device and equipment for identifying mathematical formula
CN113537227B (en) Structured text recognition method and system
US20090041361A1 (en) Character recognition apparatus, character recognition method, and computer product
CN111626146A (en) Merging cell table segmentation and identification method based on template matching
CN111091124B (en) Spine character recognition method
CN110598581B (en) Optical music score recognition method based on convolutional neural network
CN112069900A (en) Bill character recognition method and system based on convolutional neural network
WO2024041032A1 (en) Method and device for generating editable document based on non-editable graphics-text image
CN108629286A (en) A kind of remote sensing airport target detection method based on the notable model of subjective perception
CN113158895A (en) Bill identification method and device, electronic equipment and storage medium
CN116311310A (en) Universal form identification method and device combining semantic segmentation and sequence prediction
JP3228938B2 (en) Image classification method and apparatus using distribution map
CN116824608A (en) Answer sheet layout analysis method based on target detection technology
Li et al. Instance aware document image segmentation using label pyramid networks and deep watershed transformation
Kataria et al. CNN-bidirectional LSTM based optical character recognition of Sanskrit manuscripts: A comprehensive systematic literature review
Wicht et al. Camera-based sudoku recognition with deep belief network
CN114581928A (en) Form identification method and system
CN112200789B (en) Image recognition method and device, electronic equipment and storage medium
Sharma et al. Primitive feature-based optical character recognition of the Devanagari script
CN118135584A (en) Automatic handwriting form recognition method and system based on deep learning
CN117076455A (en) Intelligent identification-based policy structured storage method, medium and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination