CN113221649B - Method for solving wired table identification and analysis - Google Patents

Method for solving wired table identification and analysis Download PDF

Info

Publication number
CN113221649B
CN113221649B CN202110377638.7A CN202110377638A CN113221649B CN 113221649 B CN113221649 B CN 113221649B CN 202110377638 A CN202110377638 A CN 202110377638A CN 113221649 B CN113221649 B CN 113221649B
Authority
CN
China
Prior art keywords
picture
pictures
screening
outer contour
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110377638.7A
Other languages
Chinese (zh)
Other versions
CN113221649A (en
Inventor
郭仲穗
张锦
杨帆
张贝贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110377638.7A priority Critical patent/CN113221649B/en
Publication of CN113221649A publication Critical patent/CN113221649A/en
Application granted granted Critical
Publication of CN113221649B publication Critical patent/CN113221649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a method for solving the problem of wired table identification and analysis, which is implemented according to the following steps: step 1, converting all pdf files into a picture set, and then screening, wherein the screening process is divided into two steps, namely, searching whether an outer contour diagram exists or not to screen out pictures which may have tables or flow charts in the picture set; secondly, screening out pictures containing characters in the picture set and framing the characters out by using a rectangular frame; and 2, respectively positioning and outputting all screened objects by a designed function method. The invention solves the problems of the prior art that the tables in PDF are screened and positioned and the effective tables are difficult to output.

Description

Method for solving wired table identification and analysis
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method for solving wired table identification and analysis.
Background
With the development of internet technology, especially in the field of artificial intelligence, it has taken an important part in people's lives. The image recognition technology is an important field of artificial intelligence, is the basis of practical technologies such as stereoscopic vision, motion analysis, data fusion and the like, and has important application value in many fields such as navigation, map and terrain registration, natural resource analysis, weather forecast, environmental monitoring, physiological lesion research and the like. The image target recognition is mature for recognizing characteristics of pedestrians, license plates, human faces and the like. The research on texts is also another important field of artificial intelligence, which is established on the existing texts, and is used for word sense conversion, word frequency statistics and the like. There is also a deep development in the current aspect of teletext conversion.
Although the research of the prior art has been about pure characters or highly uniform weak image analysis for PDF text form information processing and the like, such as mature API call introduced by international large companies such as Baidu, microsoft and the like, the operation objects of the prior art are still the operation of pictures, but people currently and widely adopt forms with different shapes and colors in communication, scientific research and data analysis activities, most of the forms and the contents thereof are stored in a PDF file form, the contents may be required by some users, but the prior relevant API method cannot be called for operation in a picture format, and for the prior software capable of browsing the PDF, the software has special codes for quickly positioning the forms, but if the required form contents are pasted on the PDF in a picture form, the pictures cannot be searched, if the user is required to search target form contents in several hundred pages of PDF by himself without increasing difficulty and meeting practical requirements, so that the project will realize the positioning of the forms in the PDF for the practical situations, and screen and output effective forms. All experimental data in the project come from PDF files released by the official China bidding network.
Disclosure of Invention
The invention aims to provide a method for identifying and analyzing a wired table, which solves the problems that the table in PDF is difficult to screen and locate and the effective table is difficult to output in the prior art.
The technical scheme adopted by the invention is that the method for solving the problem of wired table identification and analysis is implemented according to the following steps:
step 1, converting all pdf files into a picture set, and then screening, wherein the screening process is divided into two steps, namely, searching whether an outer contour diagram exists or not to screen out pictures which may have tables or flow charts in the picture set; secondly, screening out pictures containing characters in the picture set and framing out the characters by using a rectangular frame;
and 2, respectively positioning and outputting all screened objects by a designed function method.
The present invention is also characterized in that,
the step 1 is as follows:
step 1.1, converting pictures of a file to be detected: inputting files to be detected as A, and converting all the files to be detected A into a picture set B through picture conversion;
step 1.2, performing binarization processing on a picture set B by utilizing graying and adaptive threshold () to convert pictures in the picture set B obtained in the step 1.1 into a new binarization picture set C;
step 1.3, applying a formula to the pictures in the binaryzation picture set C obtained in the step 1.2
Figure BDA0003011858200000031
n is more than or equal to 1, S represents the result set after all pictures are screened, si represents that the pictures in the picture set C with the serial number i contain the screening result of the outline frame, i represents the picture serial number in the picture set C, and the screening result passes through ^ and/or is greater than or equal to>
Figure BDA0003011858200000032
Respectively converting the pictures in the binarized picture set C obtained in the step 1.2 into pictures with full horizontal lines and full vertical lines by using an expansion corrosion method in the binarized picture set C, and then obtaining the superposed pictures with the full horizontal lines and the full vertical lines; then, an outline drawing of the horizontal and vertical lines after superposition is framed by adopting an outline discovery in a morphological method and coordinate information is returned;
step 1.4, screening formula for the binarization picture set C obtained in the step 1.2
Figure BDA0003011858200000033
n is more than or equal to 1, R represents a result set after all pictures are screened, ri represents a screening result containing a character frame in a picture with the serial number i, and the screening result passes through ≥ 1>
Figure BDA0003011858200000034
And expanding the characters in the picture into a regular rectangular block by utilizing corrosion expansion on the picture in the binarization picture set C through two open loop operations in a morphological method, finding out an outline formed by the image character block, framing the characters out by using a frame and returning coordinate information.
Step 1.3 is specifically as follows:
step 1.3.1, the applied screening formula is
Figure BDA0003011858200000035
n is more than or equal to 0, S represents a result set after all pictures are screened, and Si represents a screening result of an outer contour frame contained in the pictures in the picture set C with the serial number i;
step 1.3.2 by
Figure BDA0003011858200000036
Adopting a corrosion method for the pictures in the picture set C, respectively converting the pictures in the binarization picture set C obtained in the step 1.2 into pictures of full horizontal lines and full vertical lines, then obtaining the pictures of the full horizontal lines and the full vertical lines after superposition, and performing contour discovery on the superposed pictures to obtain a target picture set S containing the outer contour frame after the first screening, wherein the expression is as follows:
Figure BDA0003011858200000041
and step 1.3.3, using the outline in the morphological method to find out the outline drawing of the superposed horizontal and vertical lines and returning coordinate information.
Step 1.4 is specifically as follows:
step 1.4.1, the applied screening formula is:
Figure BDA0003011858200000042
n is more than or equal to 1, screening is carried out, R represents a target image set containing the text box after all the pictures are screened, and Ri represents a screening result with the picture serial number i;
step 1.4.2 by
Figure BDA0003011858200000043
Expanding characters in the pictures in the binaryzation picture set C into regular rectangular blocks by utilizing corrosion expansion and two open loop operations in a morphological method, finding out the outline formed by the image character blocks, and obtaining the characters contained in all the screened picturesThe target image set R of the frame, in summary, is:
Figure BDA0003011858200000044
and step 1.4.3, framing out the character blocks in the picture after the two open-loop operations and returning the coordinate information.
The table in step 2 is specifically located as follows:
step 2.A.1, adopting a logic comparison positioning method: firstly, superposing the picture obtained in the step 1.3 and the picture obtained in the step 1.4 to obtain a picture, then judging whether the superposed picture has an outer outline frame, if so, showing that a table formed by horizontal and vertical lines possibly exists in the pdf page, traversing the text frames contained in all the outer outline frames, extracting and returning central point coordinates (x and y values) of each text frame rectangle, limiting the x value and the y value to eliminate the influence of noise factors including header and footer on the subsequent algorithm execution, traversing the central point coordinates of each text frame in the outer outline frames, judging whether the outer outline frames have vertical and parallel text frames, if so, counting the number of the text frames, and intercepting and storing the outer outline frames when the number meets a set value;
step 2.A.2, the applied positioning formula is as follows:
Figure BDA0003011858200000051
n is more than or equal to 1, fi represents the positioning result of the table in the picture with the sequence number i, the success is 1, the failure is 0, F represents the table positioning result set of all pictures;
step 2.A.3, by
Figure BDA0003011858200000052
Judging the image, using the image set S obtained in the step 1.3.2 and the image set R obtained in the step 1.4.2, traversing all character block diagrams in the outer contour block diagram to obtain the coordinate x and y value of the central point of each character block rectangle, and judging whether the outer contour frame has verticality or notCounting the number of the straight and parallel character frames, and determining that a table exists when the number meets a set threshold value;
when the binary image set C is traversed, the required image sets of all the tables are obtained
Figure BDA0003011858200000053
n≥1。
The positioning of the flow chart in step 2 is specifically as follows:
step 2.B.1, adopting a logic comparison positioning method: firstly, superposing the picture obtained in the step 1.3 and the picture obtained in the step 1.4 to obtain a picture, then judging whether the superposed picture has an outer contour frame, if the superposed picture has the outer contour frame and the number of the superposed pictures is larger than a set threshold, indicating that a flow chart possibly exists in the pdf page, traversing character frames contained in all the outer contour frames, extracting and returning a central point coordinate (x and y value) of each character frame rectangle, limiting the x value and the y value to eliminate the influence of noise factors such as page headers and page footers on the subsequent algorithm execution, processing a list for storing the x coordinate, firstly, storing each coordinate value and the number of the coordinate values as key values by using a dictionary through for circulation, then, removing the same and redundant values from the values in the dictionary by using a set () function, only one value is reserved for each different value, finally, obtaining the length size processed by the set () function by using a len () function, and intercepting and storing the binarized picture C obtained in the step 1.2 when the length size meets the set value;
step 2.B.2, the applied positioning formula is:
Figure BDA0003011858200000061
n is more than or equal to 1, ti represents the positioning result of the flow chart in the picture with the serial number i, the success is 1, the failure is 0, T represents the flow chart positioning result set of all pictures;
step 2.B.3, by
Figure BDA0003011858200000062
Judging the image, and obtaining an image set S and steps by using the step 1.3.2Step 1.4.2, obtaining an image set R, traversing all character frames in the outer contour frame, counting x coordinates and the number of the character frames, finally counting the number of different coordinates, and when the number is larger than a set threshold value, considering the character frames as a flow chart;
when the binary image set C is traversed, the required image sets of all the flow charts are obtained
Figure BDA0003011858200000063
n≥1。
The method has the advantages that all pdf files are converted into a picture set, then screening and positioning are carried out, the screening process is divided into two steps, and whether an outer contour block diagram exists or not is searched for screening out pictures which may have tables or flow diagrams in the picture set; and secondly, screening out pictures containing characters in the picture set, framing out the characters by using a rectangular frame, greatly removing the interference of noise such as a large number of non-form pages, pure character pages and the like in the first two parts, and finally respectively positioning and outputting all screened forms by using a designed function method. Two screening processes and two positioning methods enable the method to have considerable accuracy and practicability.
Drawings
FIG. 1 is a horizontal corrosion expansion plot of a table;
FIG. 2 is a vertical corrosion expansion plot of a table;
FIG. 3 is a tabular overlay of corrosion expansion in the horizontal and vertical directions;
FIG. 4 is a table outline block diagram;
FIG. 5 is a table page text block diagram;
FIG. 6 is a diagram showing the outline frame and the character frame of the table
FIG. 7 is a horizontal corrosion expansion diagram of the flow chart;
FIG. 8 is a vertical corrosion expansion diagram of the flow diagram;
FIG. 9 is a superimposed view of the corrosion expansion of the flow diagram in the horizontal and vertical directions;
FIG. 10 is a block diagram of the outer contour of the flow chart;
FIG. 11 is a block diagram of flow chart page text;
FIG. 12 is a flowchart showing an overlay of an outer outline box and a text box.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention relates to a method for solving wired table identification and analysis, which is implemented according to the following steps:
step 1, converting all pdf files into a picture set, and then screening, wherein the screening process is divided into two steps, namely, searching whether an outer contour block diagram exists or not to screen out pictures which may have tables or flow charts in the picture set, as shown in fig. 4 and fig. 10; secondly, screening out pictures containing characters in the picture set and framing the characters out by using a rectangular frame, as shown in fig. 5 and 11;
the step 1 is as follows:
step 1.1, converting pictures of a file to be detected: inputting files to be detected as A, and converting all the files A to be detected into a picture set B through picture conversion;
step 1.2, performing binarization processing on a picture set B by utilizing graying and adaptive threshold () to convert pictures in the picture set B obtained in the step 1.1 into a new binarization picture set C;
step 1.2 is specifically as follows:
firstly, graying the picture set B to find a threshold value, wherein the binarization threshold value at each pixel position is not fixed but determined by the distribution of the surrounding neighborhood pixels, the binarization threshold value of the image area with higher brightness is usually higher, the binarization threshold value of the image area with lower brightness is adaptively smaller, and the local image areas with different brightness, contrast and texture have corresponding local binarization threshold values.
Step 1.3, obtaining the binary value obtained in the step 1.2Picture application formula in picture set C
Figure BDA0003011858200000081
n is more than or equal to 1 for screening, S represents a result set after all pictures are screened, si represents a screening result containing an outer contour frame in a picture set C with the serial number i, i represents a picture serial number in the picture set C, and the screening result passes through ≥ 1>
Figure BDA0003011858200000082
Respectively converting the pictures in the binarized picture set C obtained in the step 1.2 into pictures with full horizontal lines and full vertical lines, such as pictures in a figure 1, a figure 2, a figure 7 and a figure 8, and then obtaining the superposed pictures with the full horizontal lines and the full vertical lines, such as pictures in a figure 3 and a figure 9, by using an expansion corrosion method in the binarized picture set C; then adopting the contour in the morphological method to find out the contour drawing of the horizontal and vertical lines after the superposition and returning the coordinate information; as in fig. 4 and 10;
step 1.3 is specifically as follows:
step 1.3.1, the applied screening formula is
Figure BDA0003011858200000083
n is more than or equal to 0, S represents the target image set containing the outer contour frame after all the images are screened, and Si represents the screening result with the image serial number i; />
Step 1.3.2 by
Figure BDA0003011858200000084
Adopting a corrosion method for the pictures in the picture set C, respectively converting the pictures in the binarized picture set C obtained in the step 1.2 into pictures with full horizontal lines and full vertical lines, such as fig. 1, fig. 2, fig. 7, and fig. 8, then obtaining the pictures with the full horizontal lines and the full vertical lines after being superimposed, such as fig. 3 and fig. 9, and performing contour discovery on the superimposed pictures to obtain a target picture set S containing an outer contour frame after the first screening, wherein the expression is as follows:
Figure BDA0003011858200000091
step 1.3.3, using the outline in the morphological method to find out the outline drawing of the horizontal and vertical lines after the superposition and returning the coordinate information, as shown in fig. 4 and fig. 10;
step 1.4, screening formula for the binaryzation picture set C obtained in the step 1.2
Figure BDA0003011858200000092
n is more than or equal to 1, R represents the result set after all pictures are screened, ri represents the screening result of the text box with picture serial number i, the screening result is obtained by
Figure BDA0003011858200000093
Expanding the characters in the picture into regular rectangular blocks by using corrosion expansion on the pictures in the binary picture set C through two open loop operations in a morphological method, finding out the outline formed by the image character blocks, framing the characters out by frames and returning the coordinate information, such as the images in figures 5 and 11
Step 1.4 is specifically as follows:
step 1.4.1, the applied screening formula is:
Figure BDA0003011858200000094
n is more than or equal to 1, screening is carried out, R represents a target image set containing the text box after all the pictures are screened, and Ri represents a screening result with the picture serial number i;
step 1.4.2 by
Figure BDA0003011858200000095
Expanding the characters in the pictures in the binarized picture set C into regular rectangular blocks by utilizing corrosion expansion and two open-loop operations in a morphological method, finding out the outline formed by the image character blocks, and obtaining a target picture set R containing character frames after all the pictures are screened, wherein the expression is as follows:
Figure BDA0003011858200000096
and step 1.4.3, framing out the character blocks in the picture subjected to the two open-loop operations and returning coordinate information. As in fig. 5 and 11;
and 2, respectively positioning and outputting all screened objects by a designed function method.
The table in step 2 is specifically located as follows:
step 2.A.1, adopting a logic comparison positioning method: firstly, superposing the picture obtained in the step 1.3 and the picture obtained in the step 1.4 to obtain a picture, sequentially showing in a figure 4, a figure 5 and a figure 6, judging whether the superposed picture has an outer outline frame, if so, showing that a table formed by transverse and vertical lines possibly exists in the pdf page, traversing all text frames contained in the outer outline frame, extracting and returning central point coordinates (x and y values) of rectangles of each text frame, limiting the x values and the y values to eliminate the influence of noise factors including page headers on the subsequent algorithm execution, traversing the central point coordinates of each text frame in the outer outline frame, judging whether the outer outline frame has vertical and parallel text frames, if so, counting the number of the text frames, and intercepting and storing the outer outline frame when the number meets a set certain value;
step 2.A.2, the applied positioning formula is as follows:
Figure BDA0003011858200000101
n is more than or equal to 1, fi represents the positioning result of the table in the picture with the picture serial number i, success is 1, failure is 0, F represents the table positioning result set of all pictures;
step 2.A.3, by
Figure BDA0003011858200000102
Judging the image, obtaining the image set S in the step 1.3.2 and the image set R in the step 1.4.2, traversing all character block diagrams in the outer contour block diagram to obtain the coordinate x and y values of the central point of each character block rectangle, judging whether the outer contour frame has vertical and parallel character blocks, counting the number of the character blocks, and when the number meets the set threshold value, recognizing that the character blocks existAs an existence table.
When the binary picture set C is traversed, the picture sets of all the required tables are obtained
Figure BDA0003011858200000103
n≥1。
The positioning of the flow chart in step 2 is specifically as follows:
step 2.B.1, adopting a logic comparison positioning method: firstly, superposing the picture obtained in the step 1.3 and the picture obtained in the step 1.4 to obtain a picture, sequentially showing in a graph 10, a graph 11 and a graph 12, then judging whether the superposed picture has an outer contour frame, if the superposed picture has the outer contour frame and the number of the outer contour frame is larger than a set threshold, showing that a flow chart possibly exists in the pdf page, traversing all text frames contained in the outer contour frame, extracting and returning coordinates (x and y values) of a central point of each text frame rectangle, limiting the x value and the y value to eliminate the influence of noise factors such as page headers and page footers on subsequent algorithm execution, processing a list storing the x coordinates, firstly, storing each coordinate value and the number of the coordinate values as a key value pair by using a dictionary through for circulation, then, removing the same and redundant values from the values in the dictionary by using a set () function, only keeping one value for each different value, finally obtaining the length size of the binarized picture processed by the set () function, and when the length size meets the set value, intercepting the binarized picture C obtained in the step 1.2 and storing the binarized picture;
step 2.B.2, the applied positioning formula is:
Figure BDA0003011858200000111
n is more than or equal to 1, ti represents the positioning result of the flow chart in the picture with the picture serial number i, success is 1, failure is 0, T represents the flow chart positioning result set of all pictures;
step 2.B.3, by
Figure BDA0003011858200000112
Judging the image, using the image set S obtained in the step 1.3.2 and the image set R obtained in the step 1.4.2, traversing all the character frames in the outer outline frameAnd counting the x coordinates and the number of the x coordinates, finally counting the number of the coordinates with different numbers, and considering the flow chart when the number is larger than a set threshold value.
When the binary image set C is traversed, obtaining the required image sets of all the flow charts
Figure BDA0003011858200000113
n≥1。/>

Claims (2)

1. A method for solving wired table identification and analysis is characterized by comprising the following steps:
step 1, converting all pdf files into a picture set, and then screening, wherein the screening process is divided into two steps, namely, searching whether an outer contour diagram exists or not to screen out pictures which may have tables or flow charts in the picture set; secondly, screening out pictures containing characters in the picture set and framing the characters out by using a rectangular frame;
the step 1 is specifically as follows:
step 1.1, converting pictures of a file to be detected: inputting files to be detected as A, and converting all the files A to be detected into a picture set B through picture conversion;
step 1.2, performing binarization processing on a picture set B by utilizing graying and adaptive threshold () to convert pictures in the picture set B obtained in the step 1.1 into a new binarization picture set C;
step 1.3, applying a formula to the pictures in the binarization picture set C obtained in the step 1.2
Figure FDA0004037061480000011
Screening is carried out, S represents a result set after all pictures are screened, si represents a screening result of an outer contour frame contained in the pictures in the picture set C with the serial number i, i represents the picture serial number in the picture set C, and the screening result is obtained by ^ H ^ M>
Figure FDA0004037061480000012
Method for utilizing expansion corrosion in binarization picture set CRespectively converting the pictures in the binarization picture set C obtained in the step 1.2 into pictures of full horizontal lines and full vertical lines, and then obtaining the superposed pictures of the full horizontal lines and the full vertical lines; then, an outline drawing of the horizontal and vertical lines after superposition is framed by adopting an outline discovery in a morphological method and coordinate information is returned;
the step 1.3 is specifically as follows:
step 1.3.1, the applied screening formula is
Figure FDA0004037061480000021
S represents a result set after all pictures are screened, and Si represents a screening result of an outer contour frame contained in the pictures in the picture set C with the serial number i;
step 1.3.2, by
Figure FDA0004037061480000022
Adopting a corrosion method for the pictures in the picture set C, respectively converting the pictures in the binarization picture set C obtained in the step 1.2 into pictures with full horizontal lines and full vertical lines, then obtaining the pictures with the full horizontal lines and the full vertical lines after superposition, and carrying out contour discovery on the superposed pictures to obtain a target picture set S containing the outer contour frame after the first screening, wherein the general expression is as follows:
Figure FDA0004037061480000023
step 1.3.3, using the outline in the morphological method to find out an outline drawing of the horizontal and vertical lines after superposition and returning coordinate information;
step 1.4, screening formula for the binarization picture set C obtained in the step 1.2
Figure FDA0004037061480000024
Figure FDA0004037061480000025
Screening is carried out, R represents a result set obtained after screening all pictures, and Ri represents a sequence number iThe result of the screening of the picture containing the text box is obtained by
Figure FDA0004037061480000026
Expanding characters in the pictures in the binaryzation picture set C into regular rectangular blocks by utilizing corrosion expansion and two open-loop operations in a morphological method, finding out an outline formed by the image character blocks, framing the characters out and returning coordinate information;
the step 1.4 is specifically as follows:
step 1.4.1, the applied screening formula is:
Figure FDA0004037061480000027
screening, wherein R represents a target image set containing the text box after all the images are screened, and Ri represents a screening result with the image serial number i;
step 1.4.2 by
Figure FDA0004037061480000031
Expanding the characters in the pictures in the binarized picture set C into regular rectangular blocks by utilizing corrosion expansion and two open-loop operations in a morphological method, finding out the outline formed by the image character blocks, and obtaining a target picture set R containing character frames after all the pictures are screened, wherein the expression is as follows:
Figure FDA0004037061480000032
step 1.4.3, framing out character blocks in the picture after the two open-loop operations and returning coordinate information;
step 2, respectively positioning and outputting all screened objects by a designed function method;
the table in step 2 is specifically located as follows:
step 2.A.1, adopting a logic comparison positioning method: firstly, superposing the picture obtained in the step 1.3 and the picture obtained in the step 1.4 to obtain a picture, then judging whether the superposed picture has an outer contour frame, if so, indicating that a table formed by horizontal and vertical lines possibly exists in the pdf page, traversing character frames contained in all the outer contour frames, extracting and returning central point coordinates x and y of each character frame rectangle, limiting the x value and the y value to eliminate the influence of noise factors including header and footer on the subsequent algorithm execution, traversing the central point coordinates of each character frame in the outer contour frames, judging whether the outer contour frames have vertical and parallel character frames, if so, counting the number of the character frames, and intercepting and storing the outer contour frames when the number meets a certain set value;
step 2.A.2, the applied positioning formula is as follows:
Figure FDA0004037061480000033
fi represents the positioning result of the table in the picture with the sequence number i, the success is 1, the failure is 0, F represents the table positioning result set of all pictures;
step 2.A.3, by
Figure FDA0004037061480000041
Judging the image, obtaining an image set S by using the step 1.3.2 and an image set R by using the step 1.4.2, traversing all character frame diagrams in the outer contour frame diagram to obtain central point coordinates x and y of each character frame rectangle, judging whether vertical and parallel character frames exist in the outer contour frame, counting the number of the character frames, and considering that a table exists when the number meets a set threshold value;
when the binary picture set C is traversed, the picture sets of all the required tables are obtained
Figure FDA0004037061480000042
2. The method of claim 1, wherein the positioning of the flowchart in step 2 is as follows:
step 2.B.1, adopting a logic comparison positioning method: firstly, superposing the picture obtained in the step 1.3 and the picture obtained in the step 1.4 to obtain a picture, then judging whether the superposed picture has an outer contour frame, if the superposed picture has the outer contour frame and the number of the superposed picture is larger than a set threshold, indicating that a flow chart possibly exists in the pdf page, traversing character frames contained in all the outer contour frames, extracting and returning a central point coordinate x and a central point coordinate y of each character frame rectangle, limiting the x value and the y value to eliminate the influence of noise factors such as page headers and page footers on the execution of a subsequent algorithm, processing a list for storing the x coordinate, firstly, storing each coordinate value and the number of the coordinate values as keys by using a dictionary through for-loop, then, removing the same and redundant values from the values in the dictionary by using a set () function, only one value is reserved for each different value, finally, obtaining the length size processed by the set () function by using a len () function, and intercepting and storing the binarized picture obtained in the step 1.2 when the length size meets the set value;
step 2.B.2, the applied positioning formula is as follows:
Figure FDA0004037061480000043
ti represents the positioning result of the flow chart in the picture with the sequence number i, the success is 1, the failure is 0, and T represents the flow chart positioning result set of all pictures;
step 2.B.3, by
Figure FDA0004037061480000051
Judging the image, utilizing the image set S obtained in the step 1.3.2 and the image set R obtained in the step 1.4.2, traversing all character frames in the outer contour frame, counting the x coordinates and the number of the character frames, finally counting the number of different coordinates, and considering the character frames as a flow chart when the number is greater than a set threshold value; />
When the binary image set C is traversed, the required image sets of all the flow charts are obtained
Figure FDA0004037061480000052
/>
CN202110377638.7A 2021-04-08 2021-04-08 Method for solving wired table identification and analysis Active CN113221649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110377638.7A CN113221649B (en) 2021-04-08 2021-04-08 Method for solving wired table identification and analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110377638.7A CN113221649B (en) 2021-04-08 2021-04-08 Method for solving wired table identification and analysis

Publications (2)

Publication Number Publication Date
CN113221649A CN113221649A (en) 2021-08-06
CN113221649B true CN113221649B (en) 2023-04-18

Family

ID=77086650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110377638.7A Active CN113221649B (en) 2021-04-08 2021-04-08 Method for solving wired table identification and analysis

Country Status (1)

Country Link
CN (1) CN113221649B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241503B (en) * 2021-12-17 2022-10-28 江西新华云教育科技有限公司 Method and system for acquiring error cause, readable storage medium and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512611A (en) * 2015-11-25 2016-04-20 成都数联铭品科技有限公司 Detection and identification method for form image
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN111932483A (en) * 2020-09-28 2020-11-13 江西汉辰信息技术股份有限公司 Picture processing method and device, storage medium and computer equipment
CN112069991A (en) * 2020-09-04 2020-12-11 税友软件集团股份有限公司 PDF table information extraction method and related device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110249905A1 (en) * 2010-01-15 2011-10-13 Copanion, Inc. Systems and methods for automatically extracting data from electronic documents including tables

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512611A (en) * 2015-11-25 2016-04-20 成都数联铭品科技有限公司 Detection and identification method for form image
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN112069991A (en) * 2020-09-04 2020-12-11 税友软件集团股份有限公司 PDF table information extraction method and related device
CN111932483A (en) * 2020-09-28 2020-11-13 江西汉辰信息技术股份有限公司 Picture processing method and device, storage medium and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An OpenCV-based Framework for Table Information Extraction;Jiayi Yuan et al.;《2020 IEEE International Conference on Knowledge Graph (ICKG)》;20200911;621-628 *
PDF文档表格信息的识别与提取;田翠华;《厦门理工学院学报》;20200630;70-76 *

Also Published As

Publication number Publication date
CN113221649A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN111476067B (en) Character recognition method and device for image, electronic equipment and readable storage medium
WO2022142611A1 (en) Character recognition method and apparatus, storage medium and computer device
US20110052062A1 (en) System and method for identifying pictures in documents
CN111191695A (en) Website picture tampering detection method based on deep learning
CN109241861B (en) Mathematical formula identification method, device, equipment and storage medium
Shivakumara et al. Fractals based multi-oriented text detection system for recognition in mobile video images
CN104661037B (en) The detection method and system that compression image quantization table is distorted
CN109299303B (en) Hand-drawn sketch retrieval method based on deformable convolution and depth network
CN105260428A (en) Picture processing method and apparatus
CN115424282A (en) Unstructured text table identification method and system
CN110991403A (en) Document information fragmentation extraction method based on visual deep learning
Rahim et al. Hand gesture recognition based on optimal segmentation in human-computer interaction
CN111666937A (en) Method and system for recognizing text in image
CN103455816B (en) Stroke width extraction method and device and character recognition method and system
CN113221649B (en) Method for solving wired table identification and analysis
US11915465B2 (en) Apparatus and methods for converting lineless tables into lined tables using generative adversarial networks
CN108877030B (en) Image processing method, device, terminal and computer readable storage medium
Caron et al. Use of power law models in detecting region of interest
CN111199199B (en) Action recognition method based on self-adaptive context area selection
CN115600040B (en) Phishing website identification method and device
CN110555406A (en) Video moving target identification method based on Haar-like characteristics and CNN matching
CN109800758A (en) A kind of natural scene character detecting method of maximum region detection
CN115223181A (en) Text detection-based method and device for recognizing characters of seal of report material
CN114647361A (en) Touch screen object positioning method and device based on artificial intelligence
Rani et al. Object Detection in Natural Scene Images Using Thresholding Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant