CN113537227A

CN113537227A - Structured text recognition method and system

Info

Publication number: CN113537227A
Application number: CN202110720402.9A
Authority: CN
Inventors: 张彦光; 高飞
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-10-22
Anticipated expiration: 2041-06-28
Also published as: CN113537227B

Abstract

The invention discloses a structured text recognition method and a structured text recognition system, which not only have the recognition effect of a single line of text, but also can perform multilevel text analysis and structural text combination. The invention provides a solution with pertinence aiming at the difficult points of text detection and identification. For the text bending problem, a scheme of segmenting according to a midpoint track is provided, segmentation is carried out according to the text slope, and segmentation of the bent text is achieved. For the difficult point of formula segmentation in text recognition, the combination and segmentation strategy of the formula is provided. For the detection method of the table in the image, a method for segmenting and identifying the table by detecting horizontal and vertical lines is provided.

Description

Structured text recognition method and system

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a structured text recognition method and a system; the invention establishes a structured text recognition method from images to character formulas and the like based on text data images, and relates to a table detection algorithm, a picture detection method, a text detection algorithm, a text segmentation algorithm, a text combination algorithm and a text recognition algorithm.

Background

The text recognition mainly refers to a method for extracting a character formula and an image by a recognition algorithm through machine scanning or inspection of a text material image with characters. The text recognition has a wide application range, not only in the learning field and the medical field, but also in the actual product research and development process of large enterprises. An excellent text recognition model has the characteristics of high recognition speed, low false recognition rate, stable recognition, easy use and the like.

In recent years, with the rapid development of information automation, text recognition algorithms have been greatly developed, and the conventional text recognition algorithms generally start from image processing, perform image preprocessing by methods such as binarization, image enhancement, tilt correction, and the like, and then extract text information by means of layout analysis, image segmentation, character recognition, and the like. For the deep learning algorithm, after layout analysis, some text detection networks, such as DB-Net, are used for text detection, and then character recognition methods are used. The different detection algorithms and recognition algorithms have different effects on different samples, so how to carry out collocation and how to design the algorithms are the key points of the text recognition technology.

In actual production, the text recognition technology is most widely applied in the education field, for example, most common photo-taking and question-searching apps recognize photos by an OCR technology, and then transmit the photos to a database for matching, so that similar question types can be obtained and analyzed. Besides, the text recognition method can also intelligently recognize personal information such as names, school numbers and the like in the test paper, and is helpful for the reading of the test paper by a teacher, and meanwhile, the text recognition method can even judge answers of examinees and intelligently score, so that a great amount of correction time is saved.

The intelligent text recognition system puts more strict requirements on the text recognition algorithm, and the design of the structured text recognition algorithm mainly comprises the following difficulties:

(1) for the mathematical discipline, a large number of formulas are stacked, so that great interference is caused to text detection, and a part of formulas occupy two lines of space, if not processed, a line of formulas is likely to be recognized as two lines, so that a text structure is likely to be recognized incorrectly. Meanwhile, the formula occupying the two-line space needs to be combined, so that the order of the texts is not influenced.

(2) For the text detection part, it is necessary to ensure that the detected text box only contains single-line text information, if multiple lines are detected, a recognition error is caused, and meanwhile, in the process of taking a picture, recognition of a bent text and separate recognition of a formula and characters are involved.

(3) The table in the image needs to be recognized separately from the characters, so the table needs to be recognized firstly, if the table is realized by deep learning, the space characteristic information of the table cannot be fully utilized for a common target detection network, and the later extraction of the characters is not assisted at all, so a proper table detection network needs to be found, and the identification of the texts in the table can be performed by an image method. The reference paper of the picture detection model is as follows: keypoint Triplets for Object Detection, hereinafter referred to as CenterNet, which is one of target Detection networks, is a Detection network based on a central point, and is different from a traditional target Detection network in that a training mode of the CenterNet adopts standard supervision training and obtains a result only by forward propagation network propulsion, so that a post-processing process required by the traditional target Detection network does not exist, and the speed of image Detection in the invention is ensured.

(4) A Chinese-English Recognition model refers to An article of An End-to-End convertible Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition, which is hereinafter referred to as CRNN, wherein a Network structure comprises three parts in total, namely a feature extraction part-CNN, a Sequence prediction part-RNN and a translation part-CTC, firstly zooming a picture to 32, then performing feature extraction by using the CNN to obtain a feature map of 512 x 1 w, extracting feature vectors from the feature map and putting the feature vectors into a bidirectional LSTM Network for training to obtain a posterior probability matrix, encoding a result, and then decoding by using the CTC to obtain Text information.

(5) A Visual Mark Up Decompiler, WYGIWYS in the following, which Is different from the identification of unstructured text, the identification module of the formula needs to identify characters, and also needs to find out the relative positions between the characters, the sizes of the characters and the Latex mathematical symbols in the formula, the identification network model of the formula also carries out feature extraction through CNN in the early period, then uses RNN to code the feature map by lines, processes the code by using a Visual attention mechanism and then obtains the output, the Visual attention mechanism Is a decoding process, the decoded model Is also RNN, context information vectors generated in the coding are continuously brought into the RNN to be decoded, and finally enters a full connection layer to output one-hot vectors, and finally outputs Latex sequences with structured texts, and combining to obtain the required formula text information.

Disclosure of Invention

The invention provides a structured text recognition method and a structured text recognition system aiming at texts containing structural parts such as formula diagrams, and the like, which can be applied to text recognition of general disciplines and have better effect on recognition of mathematical disciplines.

A structured text recognition method comprises the following steps:

step (1), form detection and identification:

and (3) detecting the table by adopting a deep learning model, predicting and finding horizontal and vertical lines in the document picture through a semantic segmentation network U-Net, and extracting the table in the picture by using the line segments. And then extracting and synthesizing the text information according to the table segmentation rule, and judging according to the table judgment rule to finally obtain a tabular removed picture.

Step (2), matching and detecting:

in the former operation, besides eliminating the interference of the table, the position information of the matching picture also needs to be extracted, so that the later text detection operation is not influenced by the picture, the detection of the matching picture is realized through the target detection network centret network, and the picture coordinate information which is the position information of the matching picture is obtained.

Step (3), text detection:

a method of finding the recognition area with the behavior area is adopted. For an image from which the table has been removed, three parts are mainly contained in the image: lines of text, charts and formulas. Firstly, a form-removed picture is converted into a gray map, and then a threshold value inverse binarization method is used for setting the pixel value of a text part in the picture to be 255 and setting a background part to be 0. And then, performing expansion processing on the image by using a 7-by-7 kernel, then solving a connected region of the expanded image according to an 8-connected requirement, solving the attribute of a circumscribed rectangle of the connected region, and obtaining an approximate text line in the image, namely obtaining an approximate text line image. Underlining is carried out on the approximate text line image, and the text line image without underlining is subjected to two times of horizontal and vertical direction combination operation according to the text box attribute, so that a final text box is obtained.

Step (4), text segmentation and merging:

firstly, the track of points in a text area is obtained for the extracted text box, the text box is segmented according to the track of the middle points of the text, segmentation of the bent text is completed, and the fitness of the text box is guaranteed. And then, merging the segmented texts in the vertical direction, and merging according to the characteristics of the texts of the formula with the upper structure and the lower structure, so as to solve the detection problem of the formula.

Step (5), text recognition:

firstly, dividing the text line finally obtained in the step 4 into an upper text box and a lower text box which are provided with a double-line formula and comprise a middle transverse line, carrying out merging operation in the horizontal direction on the text line to ensure the continuity of the formula, finding out the text box corresponding to the transverse line through the width and height attributes of the text box image and deleting the text box to complete the conversion from the double-line formula to the two single-line formulas, and then labeling and storing the positions of the text boxes. For the segmentation of a single line formula and a single line of text, another method is adopted: firstly labeling a text, then using an English recognition model for recognition, traversing and segmenting a single-line formula text line according to a recognition result, judging each character in the recognition result according to a traversing mode, finding a Chinese position, a digital position and a position which is not Chinese and English, then judging a non-Chinese area between Chinese character positions according to the position information, eliminating the condition that a single variable with the length less than 2 is eliminated, counting and storing all formula positions, and finally respectively putting all formula texts and Chinese and English texts into a WYGIWYS model and a CRNN model for recognition operation.

Step (6) and post-treatment:

and combining all table coordinate information, picture coordinate information, Chinese and English text information and coordinate information, formula text information and coordinate information to finally obtain structured text information.

The segmentation rule and the table judgment rule in the step (1):

according to the segmentation rule, whether each pair of horizontal lines and vertical lines intersect is firstly judged, an m x n matrix is obtained, m is the number of the horizontal lines, n is the number of the vertical lines, table structure analysis is carried out according to the m x n matrix, 1 represents intersection, and 0 represents non-intersection, so that for 1 in the matrix, corresponding intersection point coordinates can be calculated, meanwhile, the unit cells in the table are marked according to the matrix, and the information of the unit cells Box is stored.

After the table is divided, the table is judged, and the table is determined to have two conditions, wherein the first condition is that more than three horizontal lines and more than three vertical lines are needed, the second condition is that the length distance of the left end line segment and the right end line segment is approximately equal to the difference between the maximum X coordinate value and the minimum X coordinate value of the horizontal line segment, and the distance of the upper end line segment and the lower end line segment is approximately equal to the difference between the maximum Y coordinate value and the minimum Y coordinate value of the vertical direction.

The expansion operation in the step (3) is specifically as follows:

the expansion operation adopts a dilate method in OpenCV, and aims to thicken a font, enable a section of text lines which are not communicated to become communicated and facilitate subsequent Box extraction. The first expansion used a kernel of 7 x 7 and the second a kernel of 15 x 1.

The rule for removing underlines in the text line in the step (3) is specifically as follows:

firstly, obtaining a length and width value of an approximate text line according to the attribute of an external rectangle, obtaining an approximate average text line height MidianHeight through the length and width value, traversing all the external rectangles, screening the external rectangles with the height less than 0.1 MidianHeight to obtain a target external rectangle, obtaining edge line segments in the target external rectangle through LSD (least squares) linear detection, setting pixel points in an image where the edge line segments are located as 0, and obtaining the image without the line segments and obtaining the text line without the line segments by using a method of inverse binarization, expansion and connected domain solving.

The combination in the step (3) is specifically as follows:

and (3) performing horizontal combination and vertical combination twice in total, performing first horizontal and vertical direction combination, and in the same line of text, extracting two sections when a connected region is extracted from a section of text due to division of punctuation marks, combining the section of text into a line according to the coordinate characteristics of Box, and simultaneously identifying the single line of text into two lines due to fuzzy fonts, so performing vertical combination. And then combining the text frames once from the vertical direction, firstly sequencing the text frames from small to large according to the Y-axis direction, combining the sequenced external rectangular frames two by two, if the maximum value of the Y-axis of the former external rectangular frame and the minimum value of the latter external rectangular frame are less than 0.3 MidianHeight, combining the two rectangular frames, and repeating the above steps to obtain a target rotating rectangular frame for the first time, wherein the aim of vertically combining the text lines containing the formula for the second time is to remove the repeated frames. The text boxes after the processing can miss detection of small characters in the formula, and the boxes in the same column need to be merged and judged in the horizontal direction for one time, the angle of the target rotating rectangular boxes after one time is firstly calculated, if the distance between the two target rotating rectangular boxes is 0.3 x MidianHeight and the angle deviation is within 5 degrees, merging is carried out, and finally, judgment in the vertical direction is carried out, if the IOU of the external rectangular box of the target rotating rectangular boxes of the two texts is larger than 0.2 of the area of the external rectangular box of the minimum external rectangular box, merging is carried out, and the final text box is obtained.

The point track in the text area in the step (4):

and (3) carrying out inverse binarization, expansion and connection area solving on the finally extracted text box in the step (3) to find a contour corresponding to the text line, solving point values in a vertical coordinate corresponding to all horizontal coordinates according to contour coordinates to obtain track coordinates of the center of a group of text areas, averagely segmenting the track line into a set number of segments according to the horizontal coordinates, fitting a segmented curve by using a least square method, judging an error relation between the fitted curve and a true value, if the error is greater than a set threshold value, setting the point as a segmentation point, and otherwise, analyzing the next point.

The segmentation rule in the step (4):

and segmenting according to the abscissa of the point track in the text area.

The vertical direction combination rule in the step (4):

and detecting the lower bound of the next Box and the upper bound of the previous Box, and if the difference is less than one third of the standard height of the text Box, combining the two boxes up and down, wherein the rotation angle of the combined rectangle is required to be less than that of the rectangle before combination.

The segmentation rule of the double-line text in the step (5) is as follows:

firstly, segmenting a text box, further segmenting the content in the text box in a mode of expanding to obtain a connected region, merging the content in the horizontal direction for one time after segmentation, and then removing the middle transverse line interference to obtain a double-line formula segmentation result.

The traversal segmentation in step (5), i.e. the segmentation rule of text line traversal, is as follows:

firstly, decoding CTC in a Chinese-English recognition model to obtain text information, analyzing the text information to obtain the position of each character appearing for the first time in the text information, labeling, judging each character, only connecting the serial numbers of the Chinese characters to obtain the initial serial number and the ending serial number of a Chinese area, dividing the width of a text image by the sequence length of the text information to obtain the image width corresponding to each text character, multiplying the image width by the sequence number of the Chinese area to obtain the approximate position information CNBox of the Chinese area in the image, carrying out IOU calculation on the Box and the CNBox according to the character position Box in the image processing of a non-formula text line, combining the boxes to form a new CNBox if the IOU is more than 0, sending the new CNBox into a CRNN network for recognition, if the English character exists in the text information, reducing the new CNBox to a character position at one end with the character, and finally obtaining a Chinese area until no English characters appear, wherein other positions are English or formula areas, and finally obtaining the formula and English position information in the single-line formula text line.

The invention has the following beneficial effects:

in recent years, the application of text recognition is more and more extensive, various targeted text recognition algorithms are more and more diversified, and aiming at the relatively few subject text recognition algorithms on the market at present, the invention provides a structured text recognition method and a structured text recognition system, which not only have the recognition effect of a single line of text, but also can perform multi-level text analysis and structural text combination.

Meanwhile, the invention provides a solution with pertinence aiming at the difficult points of detection and identification of the text. For the text bending problem, a scheme of segmenting according to a midpoint track is provided, segmentation is carried out according to the text slope, and segmentation of the bent text is achieved. For the difficult point of formula segmentation in text recognition, the combination and segmentation strategy of the formula is provided. For the detection method of the table in the image, a method for segmenting and identifying the table by detecting horizontal and vertical lines is provided.

Drawings

FIG. 1 is a schematic view of an embodiment of the present invention;

FIG. 2 is a schematic cut-away view of an embodiment of the present invention;

FIG. 3 is a schematic diagram of a merge mode according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a two-line text segmentation process according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described with reference to the accompanying drawings and examples.

A structured text recognition method comprises the following steps:

step (1), form detection and identification:

in general document pictures, besides characters, a plurality of tables and pictures exist, if text detection is directly carried out on the document pictures, the texts in the pictures and the tables can be mistakenly identified, and the identified typesetting can be greatly influenced, so that the chart is taken out separately before text identification, and the interference on the detection of the subsequent text lines is avoided. The form has special identification means, and the picture does not need to be identified at all, and only the accurate position of the picture needs to be found. The table detection is realized by adopting a deep learning model, the horizontal lines and the vertical lines in the document picture are predicted and found through a semantic segmentation network U-Net, and the table in the picture is extracted by using the line segments. And then extracting and synthesizing the text information according to the table segmentation rule, and judging according to the table judgment rule to finally obtain the tabular removed picture.

Step (2), matching and detecting:

in the former operation, besides eliminating the interference of the table, the position information of the matching picture also needs to be extracted, so that the later text detection operation is not influenced by the picture, the detection of the matching picture is realized through the target detection network centret network, and the picture coordinate information which is the position information of the matching picture is obtained. The CenterNet is different from other traditional target detection networks, finds a target central point through key point estimation, obtains other attributes of a target in a regression mode, and is simpler, faster and more accurate.

Step (3), text detection:

the text line detection is an important part in recognition, the detection accuracy directly influences the effect of later recognition, and the part adopts a method of searching for a recognition area by a behavior area. For an image from which the table has been removed, three parts are mainly contained in the image: lines of text, charts and formulas. Firstly, a tabular picture is converted into a gray-scale image, and then a threshold value inverse binarization method is used for setting the pixel value of a text part in the picture to be 255 and setting a background part to be 0. And then, performing expansion processing on the 7-by-7 check image, then obtaining a connected region of the expanded image according to an 8-connected requirement, obtaining the attribute of a circumscribed rectangle of the connected region, and obtaining an approximate text line in the image, namely obtaining an approximate text line image. And (4) carrying out underlining removal on the approximate text line image, and carrying out horizontal and vertical direction combination operation on the text line image without underlining according to the text box attribute to obtain a final text box.

Step (4), text segmentation and merging:

for a bent text and a formula text existing in a text, the processing in the previous step obviously cannot accurately recognize the situation, the bent text cannot be subjected to frame selection of attaching under the condition of only using one line of text box, and subsequent recognition operation is influenced. Aiming at the two problems, on the basis of the step 3, the text segmentation and combination operation is started, firstly, the track of points in the text area is obtained for the extracted text box, the text box is segmented according to the track of the points in the text, the segmentation of the bent text is completed, and the fitness of the text box is ensured. And then carrying out merging operation in the vertical direction on the segmented text, merging according to the text characteristics of the formula with the upper and lower structures, and solving the detection problem of the formula.

Step (5), text recognition:

since the text recognition method needs to be divided into a chinese and english text recognition method and a formula text recognition method, the box extracted in the previous step needs to be further divided in the early stage of recognition. Firstly, dividing the text line finally obtained in the step 4 into an upper text box and a lower text box which are provided with a double-line formula and comprise a middle transverse line, carrying out merging operation in the horizontal direction on the text line to ensure the continuity of the formula, finding out the text box corresponding to the transverse line through the width and height attributes of the text box image and deleting the text box to complete the conversion from the double-line formula to the two single-line formulas, and then labeling and storing the positions of the text boxes. For the segmentation of a single line formula and a single line of text, another method is used: firstly labeling texts, then using an English recognition model for recognition, traversing and segmenting a single-line formula text line according to a recognition result, judging each character in the recognition result according to a traversing mode, finding a Chinese position, a digital position and a position which is not Chinese or English, then judging a non-Chinese area between Chinese character positions according to the position information, excluding the condition that a single variable with the length less than 2 is eliminated, counting and storing all formula positions, and finally respectively putting all formula texts and Chinese and English texts into a WYGIWYS model and a CRNN model for recognition operation.

Step (6) and post-treatment:

The segmentation rule and the table judgment rule in the step (1):

according to the segmentation rule, whether each pair of horizontal lines and vertical lines intersect is firstly judged, a matrix of m x n is obtained, m is the number of the horizontal lines, n is the number of the vertical lines, and therefore table structure analysis is conducted, as shown in fig. 2, in the matrix of fig. 2, 1 represents intersection, and 0 represents no intersection, so that for 1 in the matrix, corresponding intersection point coordinates can be calculated, meanwhile, the cells in the table are marked according to the matrix, and cell Box information is stored.

The expansion operation in the step (3) is specifically as follows:

the expansion operation adopts a dilate method in OpenCV, and aims to thicken a font, enable a section of text lines which are not communicated to become communicated and facilitate subsequent Box extraction. The kernel used for the first expansion is 7 × 7, and the kernel used for the second expansion is 15 × 1, because the first expansion is to extract the interference factors such as edge lines in the image, and the second expansion takes into account the general morphological features of the text field, so the horizontal frame is used for expansion.

The combination in the step (3) is specifically as follows:

The point track in the text area in the step (4):

The number of the set sections is 4-5 ends.

The segmentation rule in the step (4):

and segmenting according to the abscissa of the point track in the text area.

The vertical direction combination rule in the step (4):

and detecting the lower bound of the next Box and the upper bound of the previous Box, and if the difference is less than one third of the standard height of the text Box, combining the two boxes up and down, wherein the rotation angle of the combined rectangle is required to be less than that of the rectangle before combination. The specific merge pattern is shown in fig. 3.

The segmentation rule of the double-line text in the step (5) is as follows:

firstly, segmenting a text box, further segmenting the content in the text box in a mode of expanding to obtain a connected region, merging the content in the horizontal direction for one time after segmentation, and then removing the middle transverse line interference to obtain a double-line formula segmentation result. The specific flow is shown in fig. 4.

A structured text recognition system comprises a table detection and recognition module, a matching detection module, a text segmentation and combination module, a text recognition module and a post-processing module.

The table detection and identification module adopts a deep learning model to realize the detection of the table, predicts and finds the horizontal and vertical lines in the document picture through a semantic segmentation network U-Net, and extracts the table in the picture by using the line segments. And then extracting and synthesizing the text information according to the table segmentation rule, and judging according to the table judgment rule to finally obtain a de-tabbed picture.

The matching detection module realizes the detection of matching through a target detection network CenterNet network, and obtains the position information of the matching, namely the picture coordinate information.

The text detection module carries out text detection by adopting a method of searching for the identification area by the behavior area. Firstly, a form-removing picture is converted into a gray-scale image, and then a threshold value inverse binarization method is used for setting the pixel value of a text part in the picture to be 255 and setting a background part to be 0. And then, performing expansion processing on the 7-by-7 check image, then obtaining a connected region of the expanded image according to an 8-connected requirement, obtaining the attribute of a circumscribed rectangle of the connected region, and obtaining an approximate text line in the image, namely obtaining an approximate text line image. And (4) carrying out underlining removal on the approximate text line image, and carrying out horizontal and vertical direction combination operation on the text line image without underlining according to the text box attribute to obtain a final text box.

The text segmentation and combination module is used for segmenting and combining the text boxes detected by the text detection module, firstly, the track of a point in a text region is obtained for the extracted text boxes, segmentation of the text boxes is carried out according to the track of the middle point of the text, segmentation of the bent text is completed, then, combination operation in the vertical direction is carried out on the segmented text, and combination is carried out according to the characteristics of a formula text with an up-down structure.

The text recognition module adopts a WYGIWYS model and a CRNN model to recognize the formula text and the Chinese and English characters.

The post-processing module is used for combining all table coordinate information, picture coordinate information, Chinese and English text information and coordinate information, formula text information and coordinate information to finally obtain structured text information.

Example (b):

as shown in fig. 1, the present invention has the following steps:

(1) form detection and identification:

firstly extracting structural information from a form, finding out each intersection point coordinate of the form, then extracting a text in the form, wherein a U-Net network used in the text is divided into a coding part and a decoding part, the coding part adopts a YOLOV3 model, the decoding part adopts a reverse YOLOV3 model, the difference is that a channel output by a last layer of convolution kernel is 2, the used YOLO network belongs to a target detection network, and a third generation network of the third generation network carries out the clustering of Bounding Box on the basis of the first two generations, which just adapts to the length-width ratio of a target frame in straight line detection, carries out the horizontal and vertical line detection in an image through a YOLOV3 network, has good recognition effect, changes the horizontal and vertical line detection problem into a binary problem, has two characteristic outputs aiming at each pixel point, respectively comprises the confidence coefficient of the horizontal line and the confidence coefficient of the vertical line, and carries out binary processing on the whole image according to the confidence coefficient of the horizontal and vertical line by taking 0.5 as a standard, and finally, solving the communication area to obtain the corresponding horizontal line and vertical line.

The method uses a U-Net network model, training data are manually labeled, 2000 pieces of data are used together, wherein the training data are 1700 pieces of data, the test data are 300 pieces of data, the data format is an original image and a binarized image of a straight line corresponding to the original image, the training iteration number is 8 epochs, the Loss is finally converged to be 0.0012, and the DICE on a test set is 0.98.

(2) Matching and detecting:

inputting an image with any scale, using a CenterNet network to carry out matching detection, outputting coordinates of a central point and width of a matching image in the image, wherein training data are manually marked image positions (x, y, width, height) and original text images, 10000 pieces of data are provided in total, the number of matching images is 34210, the number of training iterations is 10 epochs, Loss is finally converged to 0.013, and the accuracy in a test set is 0.98.

(3) Text line detection and segmentation:

and removing the table position and the picture position in the image, finding out text lines by using a mode of expanding and solving a connected domain, then segmenting the text according to the curvature of the text, outputting all text boxes in the image, combining the text boxes according to the property of the formula, and outputting all the combined text boxes. And setting the combined double-line text box as a formula text, placing other single-line texts into a CRNN for testing, performing formula judgment on the obtained structure, and outputting coordinate information of the single-line formula text box.

(4) Formula identification:

WYGIWYS is used, formula text recognition is carried out by using a model and a method in the paper, VGG16 is used as a feature extraction module, the network structure can well extract text features, a training data set IM2LATEX-100K is used, the ratio of training data to testing data is 7:3, 10 epochs are trained, Loss is converged to 0.012 finally, and the accuracy in the testing data set is 0.9. In this document, formula pictures are input into a network, and a formula text Latex structure sequence is output and stored in a formula list.

(5) Chinese and English text recognition:

the Chinese and English are identified as a CRNN network structure, the feature extraction CNN uses VGG16, and other modules are consistent with the thesis, and training and testing are carried out on print text data. The data acquisition is totally 100w pieces of data, the height of images in all data sets is 32, the width is not fixed, the number of text characters corresponding to each image in the data sets is 10, the ratio of training data to test data in the text is 9:1, 10 epochs are trained, Loss is finally converged at 0.00012, the recognition accuracy in the test set is 0.995, the trained CRNN network is used in the text to recognize English texts, and text information and a Box corresponding to the texts are output and stored in a Chinese and English text list.

(6) Post-treatment

And (3) placing table position information in the step (1), map matching position information in the step (2), text box information in the step (3), text recognition result information in the step (5), placing the text recognition result in combination with the position information of the text box to obtain text line recognition information, sequencing the text lines up and down according to the positions of the text lines, obtaining the table recognition information in combination with the text recognition result by the table position information, extracting map matching in the original map in combination with the map matching position information, combining the information, and finally obtaining structured text information.

Claims

1. A structured text recognition method is characterized by comprising the following steps:

step (1), form detection and identification:

detecting the table by adopting a deep learning model, predicting and finding horizontal and vertical lines in a document picture through a semantic segmentation network U-Net, and extracting the table in the picture by using the line segments; extracting and synthesizing text information according to a table segmentation rule, and judging according to a table judgment rule to finally obtain a form-removed picture;

step (2), matching and detecting:

in the former operation, except for eliminating the interference of the table, the position information of the matching picture also needs to be extracted, so that the later text detection operation is not influenced by the picture, the detection of the matching picture is realized through the target detection network CenterNet, and the position information of the matching picture, namely the picture coordinate information, is obtained;

step (3), text detection:

a method for searching for an identification area by using a behavior area is adopted; for an image from which the table has been removed, three parts are mainly contained in the image: text lines, maps and formulas; firstly, converting a form-removed picture into a gray image, and then setting the pixel value of a text part in the picture to be 255 and setting a background part to be 0 by using a threshold value inverse binarization method; then, performing expansion processing on the 7-by-7 check image, then solving a connected region of the expanded image according to 8-connection requirements, solving the attribute of a circumscribed rectangle of the connected region, and obtaining an approximate text line in the image, namely obtaining an approximate text line image; underlining the approximate text line image, and performing two times of horizontal and vertical direction combination operation on the text line image without underlining according to the text box attribute to obtain a final text box;

step (4), text segmentation and merging:

firstly, the track of points in a text area is obtained for an extracted text box, the text box is segmented according to the track of the points in the text, the segmentation of the bent text is completed, and the fitness of the text box is ensured; then merging the segmented texts in the vertical direction, merging according to the text characteristics of a formula with an upper structure and a lower structure, and solving the detection problem of the formula;

step (5), text recognition:

firstly, dividing the text line finally obtained in the step 4 into an upper text box and a lower text box which are provided with double-line formulas and comprise a middle transverse line, carrying out merging operation in the horizontal direction on the text line to ensure the consistency of the formulas, finding out the text box corresponding to the transverse line through the width and height attributes of the text box image and deleting the text box to complete the conversion from one double-line formula to two single-line formulas, and then labeling and storing the positions of the text boxes; for the segmentation of a single line formula and a single line of text, another method is used: firstly labeling a text, then using an English recognition model for recognition, traversing and segmenting a single-line formula text line according to a recognition result, judging each character in the recognition result according to a traversing mode, finding a Chinese position, a digital position and a position which is not Chinese and English, then judging a non-Chinese area between Chinese character positions according to the position information, eliminating the condition that a single variable with the length less than 2 is eliminated, counting and storing all formula positions, and finally respectively putting all formula texts and Chinese and English texts into a WYGIWYS model and a CRNN model for recognition operation;

step (6) and post-treatment:

2. The method for recognizing structured texts according to claim 1, wherein the segmentation rule and the table judgment rule in step (1):

according to the segmentation rule, whether each pair of horizontal lines and vertical lines intersect is firstly judged respectively to obtain an m x n matrix, m is the number of the horizontal lines, n is the number of the vertical lines, so that table structure analysis is carried out, 1 represents intersection, and 0 represents no intersection, so that for 1 in the matrix, the corresponding intersection point coordinate can be calculated, meanwhile, the cells in the table are marked according to the matrix, and the cell Box information is stored;

after the table is divided, the table needs to be judged, and the table is determined to have two conditions, wherein the first condition is that more than three horizontal lines and more than three vertical lines are needed, the second condition is that the length distance of the left and right line segments is approximately equal to the difference between the maximum X coordinate value and the minimum X coordinate value of the horizontal line segment, and the distance between the upper line segment and the lower line segment is approximately equal to the difference between the maximum Y coordinate value and the minimum Y coordinate value in the vertical direction.

3. The method for recognizing structured texts according to claim 2, wherein the dilation operation in step (3) is specifically as follows:

the expansion operation adopts a dilate method in OpenCV, and aims to thicken a font, enable a section of text lines which are not communicated to become communicated and facilitate subsequent Box extraction; the first expansion used a kernel of 7 x 7 and the second a kernel of 15 x 1.

4. The method according to claim 3, wherein the rule for removing underlining in the text line in step (3) is as follows:

firstly, obtaining a length and width value of an approximate text line according to the attribute of an external rectangle, obtaining an approximate average text line height MidianHeight through the length and width value, traversing all the external rectangles, screening the external rectangles with the height less than 0.1 MidianHeight to obtain a target external rectangle, obtaining edge line segments in the target external rectangle through LSD (least squares) linear detection, setting pixel points in an image where the edge line segments are located as 0, and obtaining the image without the line segments and obtaining the text line without the line segments by reusing a method of inverse binarization, expansion and connected domain solving.

5. The method of claim 4, wherein the merging in step (3) is as follows:

in the step (3), two horizontal combination and vertical combination are carried out in total, the first horizontal and vertical combination is carried out, in the same line of text lines, two sections are extracted when a section of text extracts a connected region due to the division of punctuation marks, so that the section of text is combined into a line according to the coordinate characteristics of Box, simultaneously, a single line of text is identified into two lines due to fuzzy fonts, so that the text lines are vertically combined, firstly, the text frames are sequenced from small to large according to the X-axis direction, the sequenced circumscribed rectangle frames are combined in pairs, if the maximum value of the X-axis of the previous circumscribed rectangle frame and the minimum value of the next circumscribed rectangle frame are less than 0.5 Midian height, the two rectangle frames are combined, and the rest is done; combining the text frames once from the vertical direction, firstly sequencing the text frames from small to large according to the Y-axis direction, combining the sequenced external rectangular frames two by two, if the maximum value of the Y-axis of the former external rectangular frame and the minimum value of the latter external rectangular frame are less than 0.3 MidianHeight, combining the two rectangular frames, and repeating the above steps to obtain a primary target rotating rectangular frame, wherein the purpose of the second vertical combination is to remove repeated frames for the text lines containing the formula; the text box after the processing can miss the small characters in the formula, the combination judgment in the horizontal direction needs to be carried out on the boxes in the same column for one time, the angle of the target rotating rectangular boxes for one time is firstly calculated, if the distance between the two target rotating rectangular boxes is 0.3 x Midian height and the angle deviation is within 5 degrees, the combination is carried out, finally, the judgment in the vertical direction is carried out, if the IOU of the circumscribed rectangular boxes of the target rotating rectangles of the two texts is larger than 0.2 of the area of the circumscribed rectangle of the minimum circumscribed rectangular box, the combination is carried out, and the final text box is obtained.

6. The method of claim 5, wherein the point trajectory in the text region in step (4):

and (3) carrying out inverse binarization, expansion and connected region solving on the finally extracted text box in the step (3) to find a contour corresponding to the text line, solving point values in a vertical coordinate corresponding to all horizontal coordinates according to contour coordinates to obtain track coordinates of the center of a group of text regions, averagely segmenting the track line into a set number of segments according to the horizontal coordinates, fitting a segmented curve by using a least square method, judging an error relation between the fitted curve and a true value, if the error is greater than a set threshold value, setting the point as a segmentation point, and otherwise, analyzing the next point.

7. The method of claim 6, wherein the vertical merge rule of step (4):

8. The method of claim 7, wherein the segmentation rule of the two-line text in step (5) is as follows:

9. The method according to claim 8, wherein the step (5) of traversing segmentation is as follows:

firstly, decoding CTC in a Chinese-English recognition model to obtain text information, analyzing the text information to obtain the position of each character appearing for the first time in the text information, labeling, judging each character, only connecting the serial numbers of the Chinese characters to obtain the initial serial number and the ending serial number of a Chinese area, dividing the width of a text image by the sequence length of the text information to obtain the image width corresponding to each text character, multiplying the image width by the serial number of the Chinese area to obtain approximate position information CNBox of the Chinese area in the image, then carrying out IOU calculation on the Box and the CNBox according to the character position Box in non-formula text line image processing, combining the Box and the CNBox to form a new CNBox if the IOU is more than 0, then sending the CNBox into a CRNN network for recognition, if the character exists in the CNBox, reducing the new CNBox by a character position at one end of the English character, and finally obtaining a Chinese area until no English characters appear, wherein other positions are English or formula areas, and finally obtaining the formula and English position information in the single-line formula text line.

10. A structured text recognition system is characterized by comprising a table detection and recognition module, a matching detection module, a text segmentation and combination module, a text recognition module and a post-processing module;

the table detection and identification module adopts a deep learning model to realize the detection of the table, predicts and finds the horizontal and vertical lines in the document picture through a semantic segmentation network U-Net, and extracts the table in the picture by using the line segments; extracting and synthesizing text information according to a table segmentation rule, and judging according to a table judgment rule to finally obtain a form-removed picture;

the matching detection module realizes the detection of matching through a target detection network CenterNet network to obtain the position information of matching, namely picture coordinate information;

the text detection module adopts a method of searching for an identification area by a behavior area to perform text detection; firstly, converting a form-removed picture into a gray image, and then setting the pixel value of a text part in the picture to be 255 and setting a background part to be 0 by using a threshold value inverse binarization method; then, performing expansion processing on the 7-by-7 check image, then solving a connected region of the expanded image according to 8-connection requirements, solving the attribute of a circumscribed rectangle of the connected region, and obtaining an approximate text line in the image, namely obtaining an approximate text line image; underlining the approximate text line image, and performing two times of horizontal and vertical direction combination operation on the text line image without underlining according to the text box attribute to obtain a final text box;

the text segmentation and combination module is used for segmenting and combining the text boxes detected by the text detection module, firstly, the tracks of points in the text region are obtained for the extracted text boxes, the text boxes are segmented according to the tracks of the middle points of the text, the segmentation of the bent text is completed, then, the segmented text is combined in the vertical direction, and the text is combined according to the characteristics of a formula text with an up-down structure;

the text recognition module adopts a WYGIWYS model and a CRNN model to recognize the formula text and Chinese and English characters;