Form based on tabular analysis technology in image differentiates and localization method
Technical field
The present invention relates to tabular analysis technical field in image procossing, more particularly to one kind to be based on tabular analysis skill in image
The form of art differentiates and localization method.
Background technology
Paper document is a kind of common Informational Expression, and has higher stability and security, but with letter
The development of breath technology, it is difficult to information management and increasingly highlighted the shortcomings that analysis.Using image processing techniques to papery text
Shelves information is digitized processing and has become inexorable trend.
At present, domestic and international main digital document method is that paper document is scanned into the image for including various information,
Image information is extracted according to Digital image technology.During image information is extracted, the extraction of list data is very crucial
A step, if form is differentiated or positioned internal is inaccurate by mistake, not only result in and lose tableau format information, also result in
Produce the OCR recognition results of mistake.
Conventional form discrimination method is to find the straight line in image, enters line tilt correction according to straight line, in slant correction
Image in if horizontal linear and vertical line meet table features, then it is assumed that be form, but this method is on the one hand easy
Correction is inaccurate, on the other hand, has some similar to tabular drawing picture by flase drop, causes false drop rate higher.And conventional form
Positioning is by finding straight line Information locating table cell, and this method can cause position inaccurate due to the interruption of straight line.
The content of the invention
The technical problems to be solved by the invention are to provide one kind for above-mentioned the deficiencies in the prior art to be based on table in image
The form of case analysis technology differentiates and localization method, form discriminating and localization method based on tabular analysis technology in image,
Differentiate that outer encirclement frame is by the method for finding doubtful table area and internally find the horizontal and vertical straight line for meeting number
No is form, eliminates non-tabular drawing picture, differentiates that the accuracy rate of form is higher, and using each encirclement frame in searching form
Method, each encirclement frame in form can form a profile, be ranked up according to the position of profile, final position-table,
Laid the foundation for reduction form data afterwards, form locating is very accurate.
To realize above-mentioned technical purpose, the technical scheme that the present invention takes is:
Form based on tabular analysis technology in image differentiates and localization method, comprises the following steps:
1)Encirclement frame all in image will be extracted similar to form scan sample into image;
(2)Minimum length threshold, maximum length threshold, minimum widith threshold value and the Breadth Maximum threshold value of encirclement frame are set,
Set area ratio max-thresholds and area ratio minimum threshold;
(3)Maximum length threshold and minimum length threshold of the length in encirclement frame are chosen from all encirclement frames of image
Between and width is between the Breadth Maximum threshold value and minimum widith threshold value of encirclement frame and area is maximum encirclement frame;
(4)By by step(3)Obtained encirclement frame enters line tilt correction to image;
(5)Extract outer encirclement frame all in the image of slant correction, maximum length threshold of the extraction length in encirclement frame
Be worth between minimum length threshold and width is all outer between the Breadth Maximum threshold value and minimum widith threshold value of encirclement frame
Encirclement frame, and each outer encirclement frame of extraction is labeled as doubtful table area;
(6)The inside of the doubtful table area obtained to step (5) carries out looking for encirclement frame to operate, and extracts the area of encirclement frame
With all bags of the ratio of the boundary rectangle area of itself between area ratio max-thresholds and area ratio minimum threshold
Peripheral frame;
(7)The minimal amount threshold value and maximum number threshold value of the horizontal line section number included in doubtful table area are set,
The minimal amount threshold value and maximum number threshold value of the vertical segment number included in doubtful table area are set, is become by Hough
The method for changing detection of straight lines is detected by step(6)The horizontal line hop count that all encirclement frames include in obtained doubtful table area
Mesh and vertical segment number, extract the minimal amount threshold value and maximum number of the horizontal line section number that includes in horizontal line section number
Between threshold value and comprising vertical segment number between the minimal amount threshold value and maximum number threshold value of vertical segment number
The doubtful table area of extraction is simultaneously labeled as form by doubtful table area;
(8)According to finding the method for each encirclement frame in form successively to step(7)Obtained form carries out form locating.
Length is chosen as further improved technical scheme of the present invention, in all encirclement frames from image wrapping
Between the maximum length threshold and minimum length threshold of peripheral frame and width encirclement frame Breadth Maximum threshold value and minimum widith threshold
Between value and encirclement frame that area is maximum, including:
The maximum encirclement frame of area is chosen from all encirclement frames of image, by the length point of the maximum encirclement frame of area
Do not contrasted with the minimum length threshold and maximum length threshold of encirclement frame, by the width of the maximum encirclement frame of area respectively with encirclement
The minimum widith threshold value and Breadth Maximum threshold comparison of frame, if the minimum that the length of the encirclement frame of area maximum is less than encirclement frame is long
The width for spending the encirclement frame of threshold value or area maximum is less than the minimum widith threshold value of encirclement frame, then is non-table by this image labeling
Table images simultaneously reject non-tabular drawing picture, are otherwise labeled as tabular drawing picture to be detected;
If in tabular drawing picture to be detected the length of the maximum encirclement frame of area be more than encirclement frame maximum length threshold or
The width of the maximum encirclement frame of area is more than the Breadth Maximum threshold value of encirclement frame, then the big encirclement frame of area time is chosen, if area
The width of maximum length threshold or the secondary big encirclement frame of area that the length of secondary big encirclement frame is more than encirclement frame, which is more than, to be surrounded
The Breadth Maximum threshold value of frame, then choose the third-largest encirclement frame of area, meets maximum of the length in encirclement frame until choosing one
Between length threshold and minimum length threshold and bag of the width between the Breadth Maximum threshold value and minimum widith threshold value of encirclement frame
Peripheral frame.The encirclement frame of selection belongs to length between the maximum length threshold and minimum length threshold of encirclement frame and width is surrounding
The maximum encirclement frame of area in all encirclement frames between the Breadth Maximum threshold value and minimum widith threshold value of frame.
It is described by by step as further improved technical scheme of the present invention(3)Obtained encirclement frame is entered to image
Line tilt correction, including:
Detected by the method for Hough transform detection of straight lines by step(3)All line segments in obtained encirclement frame, meter
All line segments and the angle of horizontal direction simultaneously choose minimum angle, using minimum angle as tabular drawing to be detected as
The angle of rotation, tabular drawing picture to be detected is rotated, then complete the Slant Rectify to tabular drawing picture to be detected.
As further improved technical scheme of the present invention, method that the basis finds each encirclement frame in form successively
To step(7)Obtained form carries out form locating, including:
From step(7)The left upper apex of obtained form starts, and finds approached with the height of the left upper apex of form successively
Encirclement frame and be ranked up according to the front and back position of encirclement frame;
After the completion of the encirclement frame sequence of the first row, since the highest summit that the first row surrounds frame bottom, find successively
Close encirclement frame and sorted successively with the height on highest summit;
After the completion of the encirclement frame sequence of second row, surrounded according to the third line is found the step of finding the second row encirclement frame successively
Frame simultaneously sorts successively, the encirclement frame until searching out form bottommost, obtains the form that encirclement frame has sorted;
According to the coordinate setting of the Sort Direction of the encirclement frame in the form to have sorted and encirclement frame Sort Direction to tool
Body table position, complete form locating.
The present invention chooses maximum length threshold and minimum length of the length in encirclement frame from all encirclement frames of image
Between threshold value and width is between the Breadth Maximum threshold value and minimum widith threshold value of encirclement frame and area is maximum encirclement frame, if
There is no the encirclement frame chosen to satisfaction from image, then by this image authentication and be labeled as non-tabular drawing picture and reject non-tabular drawing
Picture;From all outer encirclement frames of image extract length between the maximum length threshold and minimum length threshold of encirclement frame and
All outer encirclement frames of the width between the Breadth Maximum threshold value and minimum widith threshold value of encirclement frame, and outer encirclement frame will be extracted
It is labeled as doubtful table area;It is non-form that the outer encirclement frame do not extracted, which differentiates,;Again from the doubtful table area of extraction
The outer area of encirclement frame of interior extraction and the ratio of the boundary rectangle area of itself in area ratio max-thresholds and area ratio most
All encirclement frames between small threshold value, so as to eliminate the interference of word and noise;The horizontal line section number included is extracted to exist
Between the minimal amount threshold value and maximum number threshold value of horizontal line section number and comprising vertical segment number in vertical line hop count
Doubtful table area between purpose minimal amount threshold value and maximum number threshold value, and the doubtful table area of extraction is marked
For form, it is non-form that the outer encirclement frame do not extracted, which differentiates,.The present invention is excluded successively by above-mentioned form mirror method for distinguishing
Non- form, final extraction belong to the region of the outer encirclement frame of form, differentiate that the accuracy rate of form is higher;The present invention also passes through searching
The method of each encirclement frame in form, each encirclement frame in form can form a profile, according to finding each encirclement frame
That is the position of profile is ranked up, final position-table, and the reduction form data after being lays the foundation, and form locating is accurate
Rate is higher.
Brief description of the drawings
Fig. 1 is the workflow diagram of the present invention.
Embodiment
The embodiment of the present invention is further illustrated below according to Fig. 1:
Referring to Fig. 1, the form based on tabular analysis technology in image differentiates and localization method, comprises the following steps:
(1)Various similar form samples are extracted into encirclement all in image by device scans such as scanners into image
Frame;
(2)Set minimum length threshold L1, maximum length threshold L2, minimum widith threshold value W1 and the Breadth Maximum of encirclement frame
Threshold value W2, area ratio max-thresholds S1 and area ratio minimum threshold S2 is set,;
(3)Maximum length threshold L2 and minimum length threshold of the length in encirclement frame are chosen from all encirclement frames of image
Between value L1 and width is between the Breadth Maximum threshold value W2 and minimum widith threshold value W1 of encirclement frame and area is maximum encirclement frame,
If there is no the encirclement frame chosen to meeting to require in image, this image labeling is non-tabular drawing picture and rejects non-tabular drawing
Picture;
(4)By by step(3)Obtained encirclement frame enters line tilt correction to image;
(5)The method that profile function is searched by the findContours in OpenCV carries from the image of slant correction
All outer encirclement frames are taken, by the length of all outer encirclement frames minimum length threshold L1 and maximum length with encirclement frame respectively
Threshold value L2 is contrasted, by the width of all outer encirclement frames Breadth Maximum threshold value W2 and minimum widith threshold value W1 with encirclement frame respectively
Contrast, extraction length between the maximum length threshold L2 and minimum length threshold L1 of encirclement frame and width encirclement frame maximum
All outer encirclement frames between width threshold value W2 and minimum widith threshold value W1, and each outer encirclement frame of extraction is marked
For doubtful table area, doubtful table area is set to N number of, and the outer encirclement frame for being unsatisfactory for extraction conditions is labeled as non-form and rejected
Non- form;
(6)The inside of one of them doubtful table area is carried out looking for encirclement frame to operate, calculates the interior of doubtful table area
The ratio of the area of each encirclement frame in portion and the boundary rectangle area of itself, by the area of each encirclement frame and the external square of itself
The ratio of shape area contrasts with area ratio max-thresholds S1 and area ratio minimum threshold S2 respectively, from doubtful table area
The ratio of the area of internal extraction encirclement frame and the boundary rectangle area of itself is in area ratio max-thresholds S1 and area ratio
All encirclement frames between minimum threshold S2, and then eliminate the interference of word and noise in image;
(7)Set the minimal amount threshold value H1 and maximum number threshold value of the horizontal line section number included in doubtful table area
H2, the minimal amount threshold value H3 and maximum number threshold value H4 of the vertical segment number included in doubtful table area are set, is passed through
The method of Hough transform detection of straight lines is detected by step(6)The water that all encirclement frames include in obtained doubtful table area
Horizontal line hop count mesh and vertical segment number, by step(6)The level that all encirclement frames include in obtained doubtful table area
Line segment number contrasts with the minimal amount threshold value H1 and maximum number threshold value H2 of horizontal line section number respectively, by step(6)Obtain
Doubtful table area in the vertical segment number that includes of all encirclement frame minimal amount threshold with vertical segment number respectively
Value H3 and maximum number threshold value H4 contrasts, if the horizontal line section number that includes of doubtful table area is in the minimum of horizontal line section number
Between quantity threshold H1 and maximum number threshold value H2 and comprising vertical segment number vertical segment number minimal amount threshold
Between value H3 and maximum number threshold value H4, then this doubtful table area is labeled as form, and perform step(8);Otherwise mark
For non-form and non-form is rejected, and returns to execution step(6).
(8)According to finding the method for each encirclement frame in form successively to step(7)Obtained form carries out form locating,
After positioning, execution step is returned again to(6), until N number of doubtful table area in image is carried out into step(6), step(7)With
Step(8)Operation, positioning is completed to all forms in image.
Further, in all encirclement frames from image choose length encirclement frame maximum length threshold L2 and
Between minimum length threshold L1 and width between the Breadth Maximum threshold value W2 and minimum widith threshold value W1 of encirclement frame and area most
Big encirclement frame, including:
The area of all encirclement frames of image is calculated, the maximum encirclement of area is chosen from all encirclement frames of image
Frame, minimum length threshold L1 and maximum length threshold L2 of the length of the maximum encirclement frame of area respectively with encirclement frame are contrasted,
Breadth Maximum threshold value W2 and minimum widith threshold value W1 of the width of the maximum encirclement frame of area respectively with encirclement frame are contrasted, if face
The length of the maximum encirclement frame of product is less than the minimum length threshold L1 of encirclement frame or the width of the encirclement frame of area maximum is less than
This image labeling is then non-tabular drawing picture by the minimum widith threshold value W1 of encirclement frame and rejects non-tabular drawing picture, is otherwise labeled as
Tabular drawing picture to be detected;
If in tabular drawing picture to be detected the length of the maximum encirclement frame of area be more than encirclement frame maximum length threshold L2 or
The width of the maximum encirclement frame of person's area is more than the Breadth Maximum threshold value W2 of encirclement frame, then chooses the big encirclement frame of area time, will
The length of the big encirclement frame of area time contrasts with the minimum length threshold L1 and maximum length threshold L2 of encirclement frame respectively, by area
The width of secondary big encirclement frame contrasts with the Breadth Maximum threshold value W2 and minimum widith threshold value W1 of encirclement frame respectively, if area
The length of big encirclement frame is more than the maximum length threshold L2 of encirclement frame or the width of the secondary big encirclement frame of area is more than encirclement
The Breadth Maximum threshold value W2 of frame, then the third-largest encirclement frame of area is chosen, according to the method so contrasted successively until choosing one
Individual length between the maximum length threshold L2 and minimum length threshold L1 of encirclement frame and width encirclement frame Breadth Maximum threshold
Encirclement frame between value W2 and minimum widith threshold value W1, and the encirclement frame chosen belongs to maximum length threshold of the length in encirclement frame
Between L2 and minimum length threshold L1 and institute of the width between the Breadth Maximum threshold value W2 and minimum widith threshold value W1 of encirclement frame
There is the encirclement frame that area is maximum in encirclement frame;If there is no the encirclement frame chosen to meeting to require in tabular drawing picture to be detected,
This tabular drawing picture to be detected is then labeled as non-tabular drawing picture and rejects non-tabular drawing picture.
Further, it is described by by step(3)Obtained encirclement frame enters line tilt correction to image, including:
Detected by the method for Hough transform detection of straight lines by step(3)All line segments in obtained encirclement frame, with
The left upper apex of encirclement frame is origin, using the horizontal right direction of encirclement frame as X-axis positive direction, with the side vertically downward of encirclement frame
To the angle for for Y-axis positive direction, calculating all line segments and X-axis positive direction(0-180 degree)If angle is more than 90 degree, subtracted with 180
The angle is gone, chooses the angle of minimum, the angle using the angle of minimum as form image rotation to be detected, if the angle line segment
It is more than 90 degree with the angle of X-axis positive direction, using rotate counterclockwise, otherwise using turning clockwise, is finally completed to be detected
The Slant Rectify of tabular drawing picture.
Further, the basis finds the method for each encirclement frame in form to step successively(7)Obtained form enters
Row form locating, including:
From step(7)The left upper apex of obtained form starts, using scanning method from left to right successively find and table
The close encirclement frame of the height of the left upper apex of lattice simultaneously from left to right sorts successively to encirclement frame;The encirclement frame of the first row has sorted
Cheng Hou, since the highest summit that the first row surrounds frame bottom, the encirclement close with the height on highest summit is found successively
Frame simultaneously sorts successively according to the front and back position of encirclement frame to encirclement frame;After the completion of the encirclement frame sequence of second row, according to finding the
The step of two row encirclement frames, finds the third line encirclement frame and encirclement frame is sorted successively according to the front and back position of encirclement frame successively, directly
To the encirclement frame for searching out form bottommost, now each encirclement frame in form has sorted completion;
According to the coordinate setting of the Sort Direction of the encirclement frame in the form to have sorted and encirclement frame Sort Direction to tool
Body table position, complete form locating.
Protection scope of the present invention includes but is not limited to embodiment of above, and protection scope of the present invention is with claims
It is defined, any replacement being readily apparent that to those skilled in the art that this technology is made, deformation, improvement each fall within the present invention's
Protection domain.