CN105389557A - Electronic official document classification method based on multi-region features - Google Patents
Electronic official document classification method based on multi-region features Download PDFInfo
- Publication number
- CN105389557A CN105389557A CN201510761336.4A CN201510761336A CN105389557A CN 105389557 A CN105389557 A CN 105389557A CN 201510761336 A CN201510761336 A CN 201510761336A CN 105389557 A CN105389557 A CN 105389557A
- Authority
- CN
- China
- Prior art keywords
- image
- document
- region
- region feature
- area
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
- G06V10/243—Aligning, centring, orientation detection or correction of the image by compensating for image skew or non-uniform image deformations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
Abstract
The present invention provides an electronic official document classification method based on multi-region features, comprising: an image preprocessing step in which image gray processing, image adaptive filtering, image gray stretch, image optimal threshold calculating and image binaryzation are successively performed; a region feature extraction step in which image block pixel distribution statistics features, smooth image histogram features, and image texture features are extracted; a standard document multi-region feature extraction and storage step in which standard document image preprocessing, standard document image key region selection, standard document image region feature extraction, and generation of a document type feature matrix are successively performed; and a document type identification step in which the document type feature matrix and the corresponding feature region are read from a database, the feature region image corresponding to the detected document image is obtained, a feature vector of each feature region of the detected document image is calculated, and a document type similarity is calculated according to a correlation coefficient matrix of two features. The electronic official document classification method of the present invention can perform accurate classification or identification for government official documents, and is simple in operation and convenient to realize.
Description
Technical field
The present invention relates to a kind of electronic government documents sorting technique based on multi-region feature, especially for the type identification of government document picture.
Background technology
Government document and administrative document are the abbreviations of public affair document, are that the mankind are administering the practical writing with legal authority and cannonical format used in the public affair practice of society, management country.As the topmost carrier of statement will of the state, law enforcement regulation, specification administrative law enforcement, transmission important information, to some degree, official document is the continuity of state's laws regulation and supplements.Its type generally comprises: resolution, determine, order (order), publication, bulletin, notice, suggestion, notice, circular, report, ask for instructions, give an written reply, proposal, letter, summary etc.
Along with the development of E-government, government networking, informationization, electronization day by day prevailing.In order to improve efficiency of the government, the automatic classification or the identification that realize government's electronic government documents become problem demanding prompt solution.
At present, the classification of electronic government documents is mainly limited to the classification of type of electronic document both at home and abroad, not yet has the classification of the electronic government documents of image content-based feature or recognition system or method.
Due to the document that government document is formal, have and compare set form and space of a whole page requirement.Such as: the form key element of administrative document can be divided into eyebrow head, main body, version remember three parts.It is first that each more than red anti-line key element is referred to as eyebrow; Red anti-line (not containing) is to be referred to as main body down to each key element between descriptor (not containing); Each key element below descriptor is referred to as version note.Wherein, version head by issued organ's full name or standardization abbreviation add " file " two word or add band bracket indicate language title form, be imprinted on official document homepage top between two parties with the large word of red chromatography.Associating style of writing, availablely sponsors office name, also can and with co-signing office name.Documment number is made up of authority subculture character, time and dispatch sequence number.Therefore, these key elements of official document can be used as the key point of official document type identification.
Summary of the invention
Instant invention overcomes shortcoming of the prior art, provide a kind of electronic government documents sorting technique based on multi-region feature, the method can realize the ability identified the type of the electronic government documents of existing format masterplate.
In order to solve the problems of the technologies described above, the present invention is achieved by the following technical solutions:
Based on an electronic government documents sorting technique for multi-region feature, comprise the following steps:
1) Image semantic classification
(1) image gray processing: because the electronic government documents image obtained is generally coloured image, for making process simplify, must chromatic information be transformed in gray space;
(2) image adaptive filter: the noise filtering being realized the electronic government documents to shooting by adaptive median filter;
(3) gradation of image stretches: in real image, and often there will be larger illumination different, therefore gradation of image is inconsistent, and this occurs larger error by causing subsequent treatment.By carrying out gray scale stretching to image, the adjustment of various gray-scale pixels distribution can be realized, being conducive to the deviation improving the gradation of image that image irradiation condition deficiency causes;
(4) image optimal threshold calculates: when real image binaryzation, and after carrying out binaryzation to the image of different light, its result images gap is normal larger.The adaptive threshold that the present invention realizes image by iterative algorithm calculates, and reduces the impact of image binaryzation result by illumination condition, ensures the consistance after official document binaryzation, thus ensure the accuracy of official document identification;
(5) image binaryzation: image is converted into the image only having black, white two kinds of colors;
(6) image skew correction: by the straight line of hough change detection angle within the scope of 0 ~ 5 °, realize image skew correction.
2) Region Feature Extraction
(1) image block pixel distribution statistical nature;
(2) smoothed image histogram feature;
(3) image texture characteristic;
3) standard document multi-region feature extracts and warehouse-in
(1) standard document Image semantic classification;
(2) standard document image critical area is selected;
(3) each Region Feature Extraction of standard document image, obtains each provincial characteristics vector;
(4) Doctype eigenmatrix is generated;
4) Doctype identification
(1) Doctype eigenmatrix and characteristic of correspondence region is read from database;
(2) tested file and picture character pair area image is obtained;
(3) each characteristic area proper vector of tested file and picture is calculated;
(4) eigenmatrix of tested document is calculated;
The correlation matrix of (5) two features, calculates Doctype similarity, and similarly is no foundation unanimously using this value as process decision chart.
Further, described image binaryzation is: first by carrying out gray scale stretching to image and gray scale is smoothly corrected, and then adopts optimal threshold method to carry out image binaryzation.
Further, described image block pixel distribution statistical nature is: first, to the further piecemeal of each area image; Then, add up number of pixels in each piecemeal respectively, calculate its accounting in area image.Finally, distribution statistics histogram is generated.
Further, described image smoothing image histogram is characterized as: first, and area image carries out Gaussian smoothing; Then, difference zoning gradation of image distribution histogram.
Further, described image texture characteristic is: first, and area image carries out Gaussian smoothing; Then, surf unique point and the proper vector of zoning image is distinguished.
Further, described standard document multi-region feature is: the feature extraction region each critical area of file and picture being set to document classification, by extracting the statistical nature of area image to each area image.
Compared with prior art, the invention has the beneficial effects as follows:
A kind of electronic government documents sorting technique based on multi-region feature of the present invention, can classify accurately to government document or identify, method be simple to operate, and it is convenient to realize.And there is wider applicability.Multiple popular image format file can be applicable to, support the multiple image files such as various colour, gray-scale map, the official document type that identifiable design has been put in storage simultaneously.This method can be adapted to multiple illumination condition, all can good self-adaptation to different light and shade and conditions of exposure.Energy automatic analysis background colour order range, and effectively eliminate the impact of background image for document classification.There is good robustness to rotation and noise, and better can resist the impact of neighbourhood noise.Have good accuracy rate and speed, error rate is low.
Accompanying drawing explanation
Accompanying drawing is used to provide a further understanding of the present invention, together with embodiments of the present invention for explaining the present invention, is not construed as limiting the invention, in the accompanying drawings:
Fig. 1 is the process flow diagram of Image semantic classification of the present invention.
Fig. 2 is area image feature extraction process flow diagram of the present invention.
Fig. 3 is that file characteristics of the present invention extracts and feature into base process flow diagram.
Fig. 4 is electronic government documents Doctype identification process figure of the present invention.
Fig. 5 ~ Fig. 7 is document recognition design sketch.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.
Fig. 1 to 3 is process flow diagrams of a kind of electronic government documents sorting technique based on multi-region feature of the present invention.
This method be input as electronic government documents image to be identified and standard electronic official document template image, the similarity result of output document identification.Consult Fig. 4.
1, implementation process
1) standard official document image typing
(1) electronic government documents image is read.Read electronic government documents image, image type can be JPG, BMP or other common format image file.
(2) Image semantic classification.To original image gray processing, gradation of image stretches, filtering and noise reduction, binaryzation, image skew correction etc.
(3) image-region is arranged.According to the type of electronic government documents and feature, characteristic area is set.
(4) extract each provincial characteristics, calculate electronic government documents eigenmatrix.
(5) by characteristic area and eigenmatrix stored in database.
2) tested official document type identification
(1) from database, read characteristic area and the eigenmatrix of each standard official document image respectively.
(2) eigenmatrix is calculated to the character pair region of tested official document image.
(3) file characteristics matrix is carried out similarity-rough set.
(4) official document Doctype numbering is obtained.
2 embodiments
[embodiment 1] as shown in Figure 5.Accurately can carry out the identification of Doctype when the tested image of embodiment 1 is identical with standard picture, similarity result is 1, and namely tested official document image is identical with standard official document image.
[embodiment 2] as shown in Figure 6.Accurately can carry out the identification of Doctype when tested image is different from standard picture in embodiment 2, similarity result is 0.17, and namely tested official document image is not identical with standard official document image.
[embodiment 3] as shown in Figure 7.Also accurately can carry out the identification of Doctype when tested image is different from standard picture in embodiment 3, similarity result is 0.2, and namely tested official document image is not identical with standard official document image.
Last it is noted that these are only the preferred embodiments of the present invention; be not limited to the present invention; although with reference to embodiment to invention has been detailed description; for a person skilled in the art; it still can be modified to the technical scheme described in foregoing embodiments; or equivalent replacement is carried out to wherein portion of techniques feature; but it is within the spirit and principles in the present invention all; any amendment of doing, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (6)
1. based on an electronic government documents sorting technique for multi-region feature, it is characterized in that, comprise the following steps: 1) Image semantic classification
(1) image gray processing;
(2) image adaptive filter;
(3) gradation of image stretches;
(4) image optimal threshold calculates;
(5) image binaryzation;
(6) image skew correction;
2) Region Feature Extraction
(1) image block pixel distribution statistical nature;
(2) smoothed image histogram feature;
(3) image texture characteristic;
3) standard document multi-region feature extracts and warehouse-in
(1) standard document Image semantic classification;
(2) standard document image critical area is selected;
(3) each Region Feature Extraction of standard document image, obtains each provincial characteristics vector;
(4) Doctype eigenmatrix is generated;
4) Doctype identification
(1) Doctype eigenmatrix and characteristic of correspondence region is read from database;
(2) tested file and picture character pair area image is obtained;
(3) each characteristic area proper vector of tested file and picture is calculated;
(4) eigenmatrix of tested document is calculated;
The correlation matrix of (5) two features, calculates Doctype similarity.
2. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described image binaryzation is: first by carrying out gray scale stretching to image and gray scale is smoothly corrected, and then adopts optimal threshold method to carry out image binaryzation.
3. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described image block pixel distribution statistical nature is: first, to the further piecemeal of each area image; Then, add up number of pixels in each piecemeal respectively, calculate its accounting in area image, finally, generate distribution statistics histogram.
4. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described image smoothing image histogram is characterized as: first, and area image carries out Gaussian smoothing; Then, difference zoning gradation of image distribution histogram.
5. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described image texture characteristic is: first, and area image carries out Gaussian smoothing; Then, surf unique point and the proper vector of zoning image is distinguished.
6. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described standard document multi-region feature is: the feature extraction region each critical area of file and picture being set to document classification, by extracting the statistical nature of area image to each area image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510761336.4A CN105389557A (en) | 2015-11-10 | 2015-11-10 | Electronic official document classification method based on multi-region features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510761336.4A CN105389557A (en) | 2015-11-10 | 2015-11-10 | Electronic official document classification method based on multi-region features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105389557A true CN105389557A (en) | 2016-03-09 |
Family
ID=55421829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510761336.4A Pending CN105389557A (en) | 2015-11-10 | 2015-11-10 | Electronic official document classification method based on multi-region features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105389557A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344815A (en) * | 2018-12-13 | 2019-02-15 | 深源恒际科技有限公司 | A kind of file and picture classification method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1973299A (en) * | 2004-08-19 | 2007-05-30 | 三菱电机株式会社 | Image search method and image search device |
CN101447017A (en) * | 2008-11-27 | 2009-06-03 | 浙江工业大学 | Method and system for quickly identifying and counting votes on the basis of layout analysis |
CN102663403A (en) * | 2012-04-26 | 2012-09-12 | 北京工业大学 | System and method used for extracting lane information in express way intelligent vehicle-navigation and based on vision |
CN103208001A (en) * | 2013-02-06 | 2013-07-17 | 华南师范大学 | Remote sensing image processing method combined with shape self-adaption neighborhood and texture feature extraction |
CN103226616A (en) * | 2013-05-16 | 2013-07-31 | 南京龙渊微电子科技有限公司 | Image content retrieval system and image content sparse learning method thereof |
CN104217221A (en) * | 2014-08-27 | 2014-12-17 | 重庆大学 | Method for detecting calligraphy and paintings based on textural features |
-
2015
- 2015-11-10 CN CN201510761336.4A patent/CN105389557A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1973299A (en) * | 2004-08-19 | 2007-05-30 | 三菱电机株式会社 | Image search method and image search device |
CN101447017A (en) * | 2008-11-27 | 2009-06-03 | 浙江工业大学 | Method and system for quickly identifying and counting votes on the basis of layout analysis |
CN102663403A (en) * | 2012-04-26 | 2012-09-12 | 北京工业大学 | System and method used for extracting lane information in express way intelligent vehicle-navigation and based on vision |
CN103208001A (en) * | 2013-02-06 | 2013-07-17 | 华南师范大学 | Remote sensing image processing method combined with shape self-adaption neighborhood and texture feature extraction |
CN103226616A (en) * | 2013-05-16 | 2013-07-31 | 南京龙渊微电子科技有限公司 | Image content retrieval system and image content sparse learning method thereof |
CN104217221A (en) * | 2014-08-27 | 2014-12-17 | 重庆大学 | Method for detecting calligraphy and paintings based on textural features |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344815A (en) * | 2018-12-13 | 2019-02-15 | 深源恒际科技有限公司 | A kind of file and picture classification method |
CN109344815B (en) * | 2018-12-13 | 2021-08-13 | 深源恒际科技有限公司 | Document image classification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Singh | Optical character recognition techniques: a survey | |
US8494273B2 (en) | Adaptive optical character recognition on a document with distorted characters | |
CN102542660B (en) | Bill anti-counterfeiting identification method based on bill watermark distribution characteristics | |
Moghaddam et al. | Application of multi-level classifiers and clustering for automatic word spotting in historical document images | |
CN105825211B (en) | Business card identification method, apparatus and system | |
CN106570475B (en) | A kind of dark-red enameled pottery seal search method | |
CN108053545B (en) | Certificate verification method and device, server and storage medium | |
CN106503694B (en) | Digit recognition method based on eight neighborhood feature | |
CN103577818A (en) | Method and device for recognizing image characters | |
Tian et al. | Natural scene text detection with MC–MR candidate extraction and coarse-to-fine filtering | |
CN111860525B (en) | Bottom-up optical character recognition method suitable for terminal block | |
CN109002768A (en) | Medical bill class text extraction method based on the identification of neural network text detection | |
CN112508011A (en) | OCR (optical character recognition) method and device based on neural network | |
CN109255414A (en) | A kind of colour barcode made an inventory for books, books recognition methods, electronic equipment and storage medium | |
CN103996055A (en) | Identification method based on classifiers in image document electronic material identification system | |
CN116740723A (en) | PDF document identification method based on open source Paddle framework | |
CN109446997A (en) | Document code automatic identifying method | |
CN109034154A (en) | The extraction and recognition methods of Invoice Seal duty paragraph | |
CN115713772A (en) | Transformer substation panel character recognition method, system, equipment and storage medium | |
US20220215679A1 (en) | Method of determining a density of cells in a cell image, electronic device, and storage medium | |
CN105117704A (en) | Text image consistency comparison method based on multiple features | |
CN109508712A (en) | A kind of Chinese written language recognition methods based on image | |
CN105389557A (en) | Electronic official document classification method based on multi-region features | |
Ovodov | Optical Braille recognition using object detection CNN | |
CN116343237A (en) | Bill identification method based on deep learning and knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160309 |