CN113537227B

CN113537227B - Structured text recognition method and system

Info

Publication number: CN113537227B
Application number: CN202110720402.9A
Authority: CN
Inventors: 张彦光; 高飞
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2024-02-02
Anticipated expiration: 2041-06-28
Also published as: CN113537227A

Abstract

The invention discloses a structured text recognition method and a structured text recognition system, which not only have the recognition effect of single-line texts, but also can carry out multi-level text analysis and structured text combination. Aiming at the difficulty of text detection and recognition, the invention provides a solution with pertinence. For the text bending problem, a proposal of segmentation according to a midpoint track is provided, segmentation is carried out based on the text slope, and segmentation of the bent text is realized. For the difficulty of segmentation of formulas in text recognition, a formula merging and segmentation strategy is provided. For the method for detecting the table in the image, a method for detecting the transverse and vertical lines to cut and identify the table is provided.

Description

Structured text recognition method and system

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a method and a system for recognizing a structured text; the invention establishes a structured text recognition method from an image to a text formula and the like based on a text data image, and relates to a form detection algorithm, a picture detection algorithm, a text segmentation algorithm, a text merging algorithm and a text recognition algorithm.

Background

Text recognition is mainly a method for extracting a text formula and an image by using a recognition algorithm after scanning or checking a text material image with characters by a machine. The application range of text recognition is very wide, not only comprises the study field and the medical field, but also has a large number of applications in the research and development process of actual products of large enterprises. An excellent text recognition model must have the characteristics of high recognition speed, low false recognition rate, stable recognition, usability and the like.

In recent years, with the rapid development of information automation, a text recognition algorithm has also been greatly developed, and a conventional text recognition algorithm generally performs image preprocessing by methods of binarization, image enhancement, inclination correction and the like from the processing of an image, and then performs text information extraction by means of layout analysis, image segmentation, text segmentation, character recognition and the like. For deep learning algorithms, after layout analysis, text detection is performed using some text detection network, such as DB-Net, and then character recognition methods are used. The different detection algorithms and recognition algorithms have different effects on different samples, so how to match and design algorithms is the key point of the text recognition technology.

In practical production, the text recognition technology is most widely applied in the education field, such as common photographing and searching problem app, the photo is recognized by OCR technology and then is transmitted into a database for matching, so that similar problem types can be obtained and analyzed. In addition, the text recognition method can intelligently recognize personal information such as names and academic numbers in the test paper, so that great help is brought to reading and rolling by a teacher, meanwhile, the text recognition method can even judge answers of examinees, intelligently perform scoring operation, and save a large amount of correction time.

The intelligent text recognition system puts more strict requirements on a text recognition algorithm, and the design of the structured text recognition algorithm mainly comprises the following difficulties:

(1) For mathematical disciplines, a large number of formulas are stacked, which causes great interference to text detection, and part of formulas occupy two rows of space, so if the formulas are not processed, one row of formulas can be identified as two rows, and the identification error of a text structure is caused. Meanwhile, the formula occupying two rows of space needs to be combined, so that the sequence of the texts is not influenced.

(2) For the text detection part, it is necessary to ensure that the detected text box only contains single-line text information, if a plurality of lines are detected, the recognition error is caused, meanwhile, in the processing of the photographed picture, the recognition of the bent text and the separate recognition of the formula and the text are also involved, and for the two problems, the text needs to be segmented, and the determination of the segmentation rule is one of the difficulties.

(3) The form in the image needs to be identified separately from the text, so that the form needs to be identified first, if the form is realized by deep learning, the space characteristic information of the form cannot be fully utilized for a common target detection network, and the form is not helpful for the text extraction in the later stage, so that a proper form detection network needs to be found, and the text in the form can be identified by an image method. The picture detection model reference papers are: the central Net is Keypoint Triplets for Object Detection, hereinafter referred to as CenterNet, centerNet as one of target detection networks, and is different from the traditional target detection network in that the central point-based detection network is adopted, and the training mode of the central Net adopts standard supervision training and is advanced only through a forward propagation network to obtain a result, so that the post-processing process required by the traditional target detection network is not needed, and the speed of image detection in the invention is ensured.

(4) The Chinese and English recognition model refers to An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition, hereinafter referred to as CRNN, the network structure comprises three parts in total, namely a feature extraction part-CNN, a sequence prediction part-RNN and a translation part-CTC, the scaling of the picture is firstly carried out, the high scaling is carried out to be 32, then CNN is used for carrying out feature extraction to obtain a feature map of 512 x 1 x w, feature vectors are extracted from the feature map and put into a bidirectional LSTM network for training to obtain a posterior probability matrix, the result is encoded, and then CTC is used for decoding to obtain text information.

(5) The recognition module of the formula refers to paper What You Get Is What You See: A Visual Markup Decompiler, is called WYGAWYS for short, is different from the recognition of unstructured text, and not only needs to recognize characters, but also needs to find the relative positions among the characters, the sizes of the characters and Latex mathematical symbols in the formula, the recognition network model of the formula also carries out feature extraction through CNN in the earlier stage, then uses RNN to carry out line coding on the feature map, processes the coding by using a visual attention mechanism and then obtains output, the visual attention mechanism is a decoding process, the decoded model is also RNN, context information vectors generated in the coding are continuously brought into an RNN network to be decoded, then finally enters a full-connection layer to output one-hot vectors, finally outputs Latex sequences with structured text, and the required formula text information is obtained through combination.

Disclosure of Invention

The invention provides a structured text recognition method and a structured text recognition system aiming at texts containing structural parts such as a formula chart and the like, which can be applied to text recognition of general disciplines and have better effect on recognition of mathematical disciplines.

A structured text recognition method comprises the following steps:

step (1), form detection and identification:

and detecting the table by adopting a deep learning model, predicting and finding the horizontal and vertical lines in the document picture through a semantic segmentation network U-Net, and extracting the table in the picture by using the line segments. And extracting and synthesizing the text information according to the form segmentation rule, judging through the form judgment rule, and finally obtaining the form-removed picture.

Step (2), detecting the matching diagram:

in the early operation, besides eliminating the interference of the table, the position information of the matching picture needs to be extracted so as not to be influenced by the picture in the later text detection operation, and the detection of the matching picture is realized through the target detection network CenterNet network, so that the position information of the matching picture, namely, the picture coordinate information is obtained.

Step (3), text detection:

a method of searching for the identification area in the behavior area is adopted. For an image from which a form has been removed, three parts are mainly contained in the image: text lines, charts and formulas. Firstly, converting a de-tabulated picture into a gray level picture, then setting a pixel value of a text part in the picture to 255 and setting a background part to 0 by using a threshold inverse binarization method. Then, the image is subjected to expansion processing by using a core of 7*7, then the connected region of the expanded image is obtained according to the 8 connected requirements, the circumscribed rectangular attribute of the connected region is obtained, and the approximate text line in the image is obtained, namely the approximate text line image is obtained. And (3) carrying out underline removal on the approximate text line image, and carrying out twice horizontal and vertical merging operation on the text line image with the underline removed according to the text box attribute to obtain a final text box.

Step (4), text segmentation and merging:

firstly, acquiring the track of points in a text region for the extracted text box, and cutting the text box according to the track of the middle points of the text, so that the curved text is cut, and the fit of the text box is ensured. And then carrying out vertical merging operation on the segmented text, merging according to the characteristics of the formula text with an up-down structure, and solving the detection problem of the formula.

Step (5), text recognition:

firstly, dividing the text line with the double-line formula into an upper text box and a lower text box which comprise middle transverse lines, carrying out horizontal merging operation on the upper text box and the lower text box to ensure the consistency of the formula, finding out the text box corresponding to the transverse lines through the wide and high attributes of the text box images, deleting the text box, completing the conversion from one double-line formula to two single-line formulas, and then marking and storing the positions of the text boxes. For the segmentation of a single line formula from a single line of text, another approach is used: firstly, identifying text labels, then using a Chinese-English identification model, carrying out traversal segmentation on a single-row formula text row according to an identification result, judging each character in the identification result in a traversal mode, finding Chinese positions, digital positions and positions which are not Chinese-English, judging a non-Chinese area between the Chinese character positions according to the position information, eliminating the condition that the length is less than 2 single variables, counting all formula positions and storing, and finally, respectively putting all formula texts and Chinese-English texts into a WYIWYS model and a CRNN model for identification operation.

Step (6), post-treatment:

and combining all the table coordinate information, the picture coordinate information, the Chinese and English text information and the coordinate information, and the formula text information and the coordinate information to finally obtain the structured text information.

The segmentation rule and the table judgment rule described in the step (1):

the method comprises the steps of firstly judging whether each pair of transverse lines and vertical lines intersect or not respectively to obtain a matrix with m being the number of the transverse lines and n being the number of the vertical lines, carrying out table structure analysis on the matrix, wherein 1 represents intersection, and 0 represents non-intersection, so that corresponding intersection coordinates can be calculated for 1 in the matrix, marking cells in a table according to the matrix, and storing cell Box information.

After the form is cut, the form is judged, and the form is judged to have two conditions, wherein the first condition is that more than three transverse lines and more than three vertical lines are needed, the second condition is that the long distance between the left end line segment and the right end line segment is approximately equal to the difference between the maximum X coordinate value and the minimum X coordinate value of the horizontal line segment, and the distance between the upper end line segment and the lower end line segment is approximately equal to the difference between the maximum Y coordinate value and the minimum Y coordinate value of the vertical direction.

The expansion operation in the step (3) is specifically as follows:

The expansion operation adopts a dialite method in OpenCV, and aims to thicken fonts, so that a section of text line which is not communicated becomes communicated, and the subsequent Box extraction is facilitated. The first expansion used a kernel of 7*7 and the second kernel of 15 x 1.

The rule for removing the underline in the text line in the step (3) is specifically as follows:

firstly, obtaining a length and width value of an approximate text line according to an external rectangle attribute, traversing all external rectangles through the length and width value to approximate an average text line height midi height, screening the external rectangles with the height smaller than 0.1 x midi height to obtain a target external rectangle, detecting the edge line segments in the target external rectangle through LSD straight lines, setting pixel points in images where the edge line segments are located as 0, obtaining a line-removed image, and reusing an inverse binarization, expansion and connected domain solving method to obtain the text line of the line segment.

The combination in the step (3) is specifically as follows:

in the step (3), two horizontal merging and vertical merging are performed, the first horizontal merging and the second vertical merging are performed, in the text line of the same row, two sections are extracted when a section of text is extracted from the connected region due to the segmentation of punctuation marks, so that the text line is merged into one row according to the coordinate characteristics of Box, the blurred fonts can lead to the recognition of a single row of text into two rows, the vertical merging is performed, firstly, text frames are sorted from small to large according to the X-axis direction, the sorted circumscribed rectangular frames are merged in pairs, and if the maximum value of the X-axis of the former circumscribed rectangular frame and the minimum value of the latter circumscribed rectangular frame are smaller than 0.5X-radius height, the two rectangular frames are merged, and the like. And merging the two rectangular frames, and the like, so as to finally obtain a target rotating rectangular frame, wherein the text frames are sorted from small to large according to the Y-axis direction, the sorted external rectangular frames are merged in pairs, and if the maximum value of the Y-axis of the former external rectangular frame and the minimum value of the latter external rectangular frame are smaller than 0.3 x MidianHeight, the two rectangular frames are merged, and the like, and the repeated frames are removed for the text lines containing formulas for the second vertical merging. After the processing, the text boxes are missed, small characters in the formulas are detected, the boxes in the same row are required to be combined and judged in the horizontal direction once, firstly, the angle of the once target rotating rectangular boxes is calculated, if the distance between the two target rotating rectangular boxes is 0.3 x and the angle deviation is within 5 degrees, the two target rotating rectangular boxes are combined, finally, the two target rotating rectangular boxes are judged in the vertical direction, and if the IOU of the external rectangular boxes of the two texts is larger than 0.2 of the external rectangular area of the minimum external rectangular box, the final text box is obtained.

A point track in the text area in the step (4):

and (3) performing inverse binarization, expansion and communication area solving on the finally extracted text box in the step (3), finding out the outline corresponding to the text line, solving the midpoint value of the ordinate corresponding to all the abscissa according to the outline coordinate, obtaining a group of track coordinates of the center of the text area, dividing the track line into a set number of segments according to the abscissa average, fitting the divided curve by using a least square method, judging the error relation between the fitted curve and the true value, setting the point as a dividing point if the error is larger than a set threshold, and otherwise analyzing the next point.

The segmentation rule described in step (4):

and cutting according to the horizontal coordinate of the midpoint track in the text region.

The vertical direction merging rule described in the step (4):

and detecting the lower bound of the next Box and the upper bound of the previous Box, and if the difference is smaller than one third of the standard height of the text Box, combining the two boxes up and down, wherein the rotation angle of the rectangle after combination is required to be smaller than that of the rectangle before combination.

The segmentation rule of the double-line text in the step (5) is as follows:

firstly, segmenting a text box, further segmenting the content in the text box in a mode of expanding a connected area, firstly carrying out merging operation in the horizontal direction once after segmentation, and then removing cross line interference in the middle to obtain a double-row formula segmentation result.

The segmentation rule of the traversal segmentation, namely the text line traversal, described in the step (5) is as follows:

the method comprises the steps of firstly analyzing text information obtained by CTC decoding in a Chinese-English recognition model, obtaining the position of each character appearing in the text information for the first time, carrying out marking, judging each character, carrying out serial number connection on Chinese characters only, obtaining the initial serial number and the end serial number of a Chinese area, dividing the width of the text image by the serial length of the text information, obtaining the image width corresponding to each text character, multiplying the image width by the serial number of the Chinese area, obtaining rough position information CNBox of the Chinese area in the image, carrying out IOU calculation on the Box and the CNBox in non-formula text line image processing, if the IOU is larger than 0, merging the Box and the CNBox to form a new CNBox, then sending the new CNBox into a CRNN network for recognition, reducing the position of one character at one end of the new CNBox until no English character appears, finally obtaining a Chinese formula area, and finally obtaining the English formula area and the position information in the text line of the formula.

The invention has the following beneficial effects:

in recent years, the application of text recognition is more extensive, various targeted text recognition algorithms are layered endlessly, and aiming at the relatively few subject text recognition algorithms on the market at present, the invention provides a structured text recognition method and system, which not only have the recognition effect of single-line texts, but also can carry out multi-level text analysis and structured text combination.

Meanwhile, the invention provides a solution with pertinence aiming at the difficulty of text detection and recognition. For the text bending problem, a proposal of segmentation according to a midpoint track is provided, segmentation is carried out based on the text slope, and segmentation of the bent text is realized. For the difficulty of segmentation of formulas in text recognition, a formula merging and segmentation strategy is provided. For the method for detecting the table in the image, a method for detecting the transverse and vertical lines to cut and identify the table is provided.

Drawings

FIG. 1 is a schematic representation of an embodiment of the present invention;

FIG. 2 is a schematic view illustrating the segmentation of an embodiment of the present invention;

FIG. 3 is a schematic diagram of a merging mode according to an embodiment of the present invention;

FIG. 4 is a flow chart of a two-line text segmentation in accordance with an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings and the embodiments.

A structured text recognition method comprises the following steps:

step (1), form detection and identification:

in general document pictures, besides characters, a plurality of tables and pictures exist, if text detection is directly carried out on the document pictures, the texts in the pictures and the tables can be identified by mistake, and the typesetting after identification can be greatly influenced, so that the charts are taken out independently before text identification, and interference to the detection of the following text lines is avoided. The form has special recognition means, and the picture does not need to be recognized at all, and only the accurate position of the picture needs to be found. And detecting the table by adopting a deep learning model, predicting and finding the horizontal and vertical lines in the document picture through a semantic segmentation network U-Net, and extracting the table in the picture by using the line segments. And then extracting and synthesizing the text information according to the form segmentation rule, judging through the form judgment rule, and finally obtaining the form removal picture.

Step (2), detecting the matching diagram:

in the early operation, besides eliminating the interference of the table, the position information of the matching picture needs to be extracted so as not to be influenced by the picture in the later text detection operation, and the detection of the matching picture is realized through the target detection network CenterNet network, so that the position information of the matching picture, namely, the picture coordinate information is obtained. The CenterNet is different from other traditional target detection networks, finds a target center point through key point estimation, and obtains other properties of the target in a regression mode, so that the CenterNet is simpler, faster and more accurate.

Step (3), text detection:

text line detection is an important part in recognition, the accuracy of detection directly influences the effect of later recognition, and the part adopts a method for searching a recognition area by using a behavior area. For an image from which a form has been removed, three parts are mainly contained in the image: text lines, drawings, and formulas. Firstly, converting a de-tabulated picture into a gray level picture, then setting a pixel value of a text part in the picture to 255 and setting a background part to 0 by using a threshold inverse binarization method. Then, expansion processing is performed on the check image of 7*7, then the connected region of the expanded image is obtained according to the 8-connected requirement, the circumscribed rectangular attribute of the connected region is obtained, and the approximate text line in the image is obtained, namely, the approximate text line image is obtained. And (3) performing the undershooting line removal on the approximate text line image, and performing the merging operation in the horizontal and vertical directions for twice on the text line image with the undershooting line removed according to the text box attribute to obtain the final text box.

Step (4), text segmentation and merging:

for the curved text and the formula text existing in the text, the situation that the processing in the previous step obviously cannot accurately identify the curved text cannot select the text by attaching the text in a single-line text box, so that the subsequent identification operation is influenced, and for the formula text, the two-line space is occupied by the formula text, so that the merging operation of the texts is involved, and the single formula cannot be cut into an upper block and a lower block. Aiming at the two problems, on the basis of the step 3, the segmentation and merging operation of the text is started, the track of the points in the text area is firstly obtained for the extracted text box, the segmentation of the text box is carried out according to the midpoint track of the text, the segmentation of the bent text is completed, and the fit of the text box is ensured. And then carrying out vertical merging operation on the segmented texts, merging according to the characteristics of the formula texts with the upper and lower structures, and solving the detection problem of the formula.

Step (5), text recognition:

the text recognition method needs to be divided into a Chinese-English text recognition method and a formula text recognition method, so that the box extracted in the previous step needs to be further divided in the earlier stage of recognition. Firstly, dividing the text line with a double-line formula into an upper text line and a lower text line which comprise middle transverse lines, carrying out horizontal merging operation on the text line with the double-line formula to ensure consistency of formulas, finding out the text box corresponding to the transverse lines through the wide and high attributes of the text box images, deleting the text box, completing conversion from a double-line formula to two single-line formulas, and then marking and storing the positions of the text boxes. For segmentation of a single line formula from a single line of text, another approach is used: firstly, identifying text labels, then using a Chinese-English identification model, carrying out traversal segmentation on a single-row formula text row according to an identification result, judging each character in the identification result in a traversal mode, finding Chinese positions, digital positions and positions which are not Chinese-English, judging a non-Chinese area between the Chinese character positions according to the position information, eliminating the condition of single variables with the length less than 2, counting and storing all formula positions, and finally, respectively putting all formula texts and Chinese-English texts into a WYGAWYS model and a CRNN model for identification operation.

Step (6), post-treatment:

The segmentation rule and the table judgment rule described in the step (1):

the segmentation rule is that whether each pair of transverse lines and vertical lines intersect or not is firstly judged to obtain a m x n matrix, m is the number of the transverse lines and n is the number of the vertical lines, and table structure analysis is carried out according to the m x n matrix, as shown in fig. 2, in the matrix of fig. 2, 1 represents intersection, 0 represents no intersection, so that corresponding intersection coordinates can be calculated for 1 in the matrix, meanwhile, marking of cells in a table is carried out according to the matrix, and cell Box information is stored.

The expansion operation in the step (3) is specifically as follows:

the expansion operation adopts a dialite method in OpenCV, and aims to thicken fonts, so that a section of text line which is not communicated becomes communicated, and the subsequent Box extraction is facilitated. The first expansion uses a kernel 7*7 and the second expansion uses a kernel 15 x 1, because the first expansion is to extract interference factors in the image, such as edge lines, and the second expansion takes into account the general morphological characteristics of the text field, so the expansion is performed using a cross frame.

The combination in the step (3) is specifically as follows:

A point track in the text area in the step (4):

The number of the set sections is 4-5.

The segmentation rule described in step (4):

The vertical direction merging rule described in the step (4):

and detecting the lower bound of the next Box and the upper bound of the previous Box, and if the difference is smaller than one third of the standard height of the text Box, combining the two boxes up and down, wherein the rotation angle of the rectangle after combination is required to be smaller than that of the rectangle before combination. A specific merge mode is shown in fig. 3.

The segmentation rule of the double-line text in the step (5) is as follows:

firstly, segmenting a text box, further segmenting the content in the text box in a mode of expanding a connected area, firstly carrying out merging operation in the horizontal direction once after segmentation, and then removing cross line interference in the middle to obtain a double-row formula segmentation result. The specific flow is shown in fig. 4.

A structured text recognition system comprises a form detection and recognition module, a map matching detection module, a text segmentation and merging module, a text recognition module and a post-processing module.

The table detection and recognition module adopts a deep learning model to realize the detection of the table, predicts and finds the horizontal and vertical lines in the document picture through a semantic segmentation network U-Net, and extracts the table in the picture by using the line segments. And extracting and synthesizing the text information according to the form segmentation rule, judging through the form judgment rule, and finally obtaining the form-removed picture.

The map matching detection module detects the map matching through a target detection network CenterNet network to obtain the position information of the map matching, namely, the picture coordinate information.

The text detection module performs text detection by adopting a method of searching an identification area by using a behavior area. Firstly, converting a de-tabulated picture into a gray level picture, then setting a pixel value of a text part in the picture to 255 and setting a background part to 0 by using a threshold inverse binarization method. Then, expansion processing is performed on the check image of 7*7, then the connected region of the expanded image is obtained according to the 8-connected requirement, the circumscribed rectangular attribute of the connected region is obtained, and the approximate text line in the image is obtained, namely, the approximate text line image is obtained. And (3) performing the undershooting line removal on the approximate text line image, and performing the merging operation in the horizontal and vertical directions for twice on the text line image with the undershooting line removed according to the text box attribute to obtain the final text box.

The text segmentation and merging module is used for segmenting and merging the text boxes detected by the text detection module, firstly obtaining the track of points in the text region for the extracted text boxes, segmenting the text boxes according to the midpoint track of the text, completing segmentation of the bent text, then conducting merging operation in the vertical direction for the segmented text, and merging according to the characteristics of the formula text with the upper and lower structures.

The text recognition module adopts a WYGIWYS model and a CRNN model to recognize the formula text and the Chinese and English text.

The post-processing module is used for combining all the table coordinate information, the picture coordinate information, the Chinese and English text information and the coordinate information, and the formula text information and the coordinate information to finally obtain the structured text information.

Examples:

as shown in fig. 1, the present invention has the following steps:

(1) Form detection and identification:

firstly, extracting structured information from a table, finding each intersection point coordinate of the table, then extracting a text in the table, dividing a U-Net network used in the text into an encoding part and a decoding part, wherein the encoding part adopts a YOLOV3 model, the decoding part adopts a reverse YOLOV3 model, the difference is that a channel output by a convolution kernel of the last layer is 2, the used YOLO network belongs to a target detection network, the third generation network thereof performs clustering of a Bounding Box on the basis of the first two generations, the clustering is exactly suitable for the aspect ratio of a target frame in straight line detection, transverse and vertical line detection in an image is performed through the YOLOV3 network, the recognition effect is good, the transverse and vertical line detection problem is changed into a two-classification problem, two characteristics are output for each pixel point, the confidence of the transverse line and the confidence of the vertical line are respectively, the whole image is subjected to binarization processing according to the confidence level by taking 0.5 as a standard, and finally, the communication area is obtained.

The training data are artificially marked by using a U-Net network model, 2000 pieces of data are used in total, 1700 pieces of training data are used, 300 pieces of test data are used, the data format is a binary image of an original image and a straight line corresponding to the original image, the training iteration times are 8 epochs, the Loss finally converges to 0.0012, and the DICE on a test set is 0.98.

(2) And (3) matching and detecting:

the method comprises the steps of inputting images of any scale, carrying out image matching detection by using a CenterNet network, outputting coordinates of a center point of the image matching and width and height of the image, wherein training data are manually marked image positions (x, y, width, height) and original text images, and 10000 pieces of data are shared, wherein the number of the image matching is 34210, the number of training iterations is 10 epochs, the Loss finally converges to 0.013, and the accuracy in a test set is 0.98.

(3) Text line detection and segmentation:

removing the position of a table and the position of a picture in the image, finding out text lines by using the modes of expanding and solving connected domains, then carrying out text segmentation according to the curvature of the text, outputting all text boxes in the image, carrying out merging operation on the text boxes according to the nature of a formula, and outputting all the text boxes after merging. Setting the combined double-line text box as formula text, putting other single-line text into CRNN for testing, carrying out formula judgment on the obtained structure, and outputting single-line formula text box coordinate information.

(4) And (3) formula identification:

the WYGAWYS is used, the model and the method in the paper are used for carrying out formula text recognition, the VGG16 is used for the feature extraction module, the network structure can well extract text features, the training data set IM2LATEX-100K is used, the training and testing data ratio is 7:3, the training rate of 10 epochs is finally converged to 0.012, and the accuracy rate in the testing data set is 0.9. In the text, the formula picture is input into a network, a formula text Latex structure sequence is output, and the formula text Latex structure sequence is stored in a formula list.

(5) Chinese and English text recognition:

the Chinese and English in the text are identified as CRNN network structures, the VGG16 is used for feature extraction CNN, other modules are consistent with papers, and training and testing are carried out on the text data of the printed body. The method comprises the steps of data acquisition, namely 100w data, wherein the height of an image in all data sets is 32, the width is not fixed, the number of text characters corresponding to each image in the data sets is 10, the ratio of training data to test data is 9:1, 10 epochs are trained, the Loss finally converges to 0.00012, the recognition accuracy in the test sets is 0.995, the trained CRNN network is used for recognizing Chinese and English texts, text information and boxes corresponding to the texts are output, and the text information and boxes corresponding to the texts are stored in a Chinese and English text list.

(6) Post-treatment

And (3) carrying out text recognition result information, namely placing text recognition results in combination with the position information of the text boxes to obtain text line recognition information, sorting the text lines up and down according to the positions of the text lines, obtaining form recognition information by combining the form position information with the text recognition results, extracting a map in an original map by combining the map matching position information, and finally obtaining structured text information.

Claims

1. A structured document recognition method characterized by the steps of:

step (1), form detection and identification:

adopting a deep learning model to realize the detection of the table, predicting and finding the horizontal and vertical lines in the document picture through a semantic segmentation network U-Net, and extracting the table in the picture by using the line segments; then extracting and synthesizing text information in the text information according to a form segmentation rule, judging the text information through a form judgment rule, and finally obtaining a form-removed picture;

step (2), detecting the matching diagram:

besides eliminating the interference of the table, the position information of the matching picture needs to be extracted so as not to be influenced by the picture in the later text detection operation, and the matching picture is detected through a target detection network CenterNet network to obtain the position information of the matching picture, namely picture coordinate information;

Step (3), text detection:

adopting a method for searching an identification area by using a behavior area; for an image from which a form has been removed, three parts are mainly contained in the image: text lines, drawings and formulas; firstly, converting a de-tabulated picture into a gray level picture, then setting a pixel value of a text part in the picture to 255 and setting a background part to 0 by using a threshold inverse binarization method; then, performing expansion processing by using a check image of 7*7, then obtaining a connected region of the expanded image according to an 8-connected requirement, obtaining the external rectangular attribute of the connected region, and obtaining an approximate text line in the image, namely obtaining an approximate text line image; the method comprises the steps of performing underline removal on an approximate text line image, and performing horizontal and vertical merging operation on the text line image with the underline removed twice according to text box attributes to obtain a final text box;

step (4), text segmentation and merging:

firstly, acquiring the track of points in a text region for the extracted text box, and cutting the text box according to the track of the middle point of the text, so that the curved text is cut, and the fit of the text box is ensured; then, carrying out vertical merging operation on the segmented text, merging according to the characteristics of the formula text with an up-down structure, and solving the detection problem of the formula;

Step (5), text recognition:

firstly, dividing the text line with a double-line formula into an upper text box and a lower text box which comprise middle transverse lines, carrying out horizontal merging operation on the text lines obtained in the step (4) to ensure consistency of the formulas, finding out the text box corresponding to the transverse lines through the wide and high attributes of the images of the text boxes, deleting the text box, completing conversion from one double-line formula to two single-line formulas, and then marking and storing the positions of the text boxes; for segmentation of a single line formula from a single line of text, another approach is used: firstly, identifying text labels, then using a Chinese-English identification model, performing traversal segmentation on a single-row formula text row according to an identification result, judging each character in the identification result in a traversal mode, finding Chinese positions, digital positions and positions which are not Chinese-English, judging a non-Chinese area between the Chinese character positions according to the position information, eliminating the condition of single variables with the length less than 2, counting all formula positions and storing, and finally, respectively putting all formula texts and Chinese-English texts into a WYIWYS model and a CRNN model for identification operation;

Step (6), post-treatment:

2. The method of claim 1, wherein the segmentation rule and the table judgment rule in step (1):

the method comprises the steps of firstly judging whether each pair of transverse lines and vertical lines intersect or not respectively to obtain a matrix with m being the number of the transverse lines and n being the number of the vertical lines, carrying out table structure analysis on the matrix, wherein 1 represents intersection, and 0 represents no intersection, so that corresponding intersection coordinates can be calculated for 1 in the matrix, marking cells in the table according to the matrix, and storing cell Box information;

3. A method of structured document recognition as recited in claim 2 wherein the expanding operation of step (3) is specifically as follows:

the expansion operation adopts a dialite method in OpenCV, so that the purpose is to thicken the fonts, a section of text lines which are not communicated are communicated, and the subsequent Box extraction is facilitated; the first expansion used a kernel of 7*7 and the second kernel of 15 x 1.

4. A structured document recognition method according to claim 3 wherein the rule for underline removal in the document line of step (3) is specifically as follows:

firstly, obtaining a length and width value of an approximate text line according to an external rectangle attribute, traversing all external rectangles through the length and width value to approximate an average text line height midi height, screening the external rectangles with the height smaller than 0.1 x midi height to obtain a target external rectangle, detecting the edge line segments in the target external rectangle through LSD straight lines, setting pixel points in images where the edge line segments are located to be 0, obtaining a line-removed image, and reusing an inverse binarization, expansion and connected domain solving method to obtain the text line from which the line segments are removed.

5. The method of claim 4, wherein the merging in step (3) is specifically as follows:

In the step (3), two horizontal merging and vertical merging are carried out, the first horizontal merging and the vertical merging are carried out, in the text line of the same line, two sections are extracted when a section of text is extracted from a connected region due to the segmentation of punctuation marks, so that the text line is merged into one line according to the coordinate characteristics of Box, simultaneously, a single line of text is recognized as two lines due to the fuzzy fonts, the vertical merging is carried out, firstly, text frames are sorted from small to large according to the X axis direction, the sorted external rectangular frames are merged in pairs, and if the maximum value of the X axis of the former external rectangular frame and the minimum value of the latter external rectangular frame are smaller than 0.5 MidianHeight, the two rectangular frames are merged, and the like; then merging the text frames from the vertical direction once, firstly sorting the text frames from small to large according to the Y-axis direction, merging the sorted external rectangular frames pairwise, if the maximum value of the Y-axis of the former external rectangular frame and the minimum value of the latter external rectangular frame are smaller than 0.3 x midi height, merging the two rectangular frames, and the like, finally obtaining a primary target rotating rectangular frame, and for text lines containing formulas, removing repeated frames for the second vertical merging; after the processing, the text boxes are missed, small characters in the formulas are detected, the boxes in the same row are required to be combined and judged once in the horizontal direction, firstly, the angle of the once target rotating rectangular boxes is calculated, if the distance between the two target rotating rectangular boxes is 0.3 x and the angle deviation is within 5 degrees, the two target rotating rectangular boxes are combined, finally, the two target rotating rectangular boxes are judged in the vertical direction, and if the IOU of the circumscribed rectangular boxes of the two texts is larger than 0.2 of the circumscribed rectangular area of the smallest circumscribed rectangular box, the final text box is obtained.

6. The method of claim 5, wherein the point trace in the text region in step (4):

and (3) performing inverse binarization, expansion and connection region solving on the finally extracted text box in the step (3), finding out the outline corresponding to the text line, solving the midpoint value of the ordinate corresponding to all the abscissa according to the outline coordinate, obtaining a group of track coordinates of the center of the text region, dividing the track line into a set number of segments according to the abscissa average, fitting the divided curve by using a least square method, judging the error relation between the fitted curve and the true value, setting the point as a dividing point if the error is larger than a set threshold, and otherwise analyzing the next point.

7. The method of claim 6, wherein the vertical direction merging rule of step (4):

and detecting the lower bound of the latter Box and the upper bound of the former Box, and if the difference is smaller than one third of the standard height of the text Box, combining the two boxes up and down, wherein the rotation angle of the rectangle after combination is required to be smaller than that of the rectangle before combination.

8. The method of claim 7, wherein the segmentation rule of the double-line text in the step (5) is as follows:

9. A method of structured document recognition according to claim 8, wherein the segmentation rules for traversing the segmentation, i.e. the text line, of step (5) are as follows:

the method comprises the steps of firstly analyzing text information obtained by CTC decoding in a Chinese-English recognition model, obtaining the position of each character appearing in the text information for the first time, carrying out marking, judging each character, carrying out serial number connection on Chinese characters only, obtaining the initial serial number and the end serial number of a Chinese area, dividing the serial length of the text information by the width of the text image, obtaining the image width corresponding to each text character, multiplying the image width by the serial number of the Chinese area, obtaining rough position information CNBox of the Chinese area in the image, carrying out IOU calculation on the Box and the CNBox in non-formula text line image processing, merging to form a new CNBox, then sending the new CNBox into a CRNN network for recognition if the IOU is larger than 0, reducing one character position of one end of each new CNBox until no character appears, and finally obtaining a Chinese area, wherein other positions are English or formula areas, and finally obtaining English formula and position information in a single-row text line.

10. The structured text recognition system is characterized by comprising a form detection and recognition module, a map matching detection module, a text segmentation and combination module, a text recognition module and a post-processing module;

the table detection and recognition module adopts a deep learning model to realize the detection of the table, predicts and finds the horizontal and vertical lines in the document picture through a semantic segmentation network U-Net, and extracts the table in the picture by using the line segments; then extracting and synthesizing text information in the text information according to a form segmentation rule, judging the text information through a form judgment rule, and finally obtaining a form-removed picture;

the map matching detection module detects a map matching through a target detection network CenterNet network to obtain position information of the map matching, namely picture coordinate information;

the text detection module carries out text detection by adopting a method of searching an identification area in a behavior area; firstly, converting a de-tabulated picture into a gray level picture, then setting a pixel value of a text part in the picture to 255 and setting a background part to 0 by using a threshold inverse binarization method; then, performing expansion processing by using a check image of 7*7, then obtaining a connected region of the expanded image according to an 8-connected requirement, obtaining the external rectangular attribute of the connected region, and obtaining an approximate text line in the image, namely obtaining an approximate text line image; the method comprises the steps of performing underline removal on an approximate text line image, and performing horizontal and vertical merging operation on the text line image with the underline removed twice according to text box attributes to obtain a final text box;

The text segmentation and merging module is used for segmenting and merging the text boxes detected by the text detection module, firstly acquiring the track of points in a text region for the extracted text boxes, segmenting the text boxes according to the midpoint track of the text, completing segmentation of the bent text, then carrying out merging operation in the vertical direction for the segmented text, and merging according to the characteristics of the formula text with an upper structure and a lower structure;

the text recognition module adopts a WYGIWYS model and a CRNN model to recognize the formula text and the Chinese and English text;