CN105389557A - Electronic official document classification method based on multi-region features - Google Patents

Electronic official document classification method based on multi-region features Download PDF

Info

Publication number
CN105389557A
CN105389557A CN201510761336.4A CN201510761336A CN105389557A CN 105389557 A CN105389557 A CN 105389557A CN 201510761336 A CN201510761336 A CN 201510761336A CN 105389557 A CN105389557 A CN 105389557A
Authority
CN
China
Prior art keywords
image
document
region
region feature
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510761336.4A
Other languages
Chinese (zh)
Inventor
王东
李晓东
陈俊健
顾艳春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan University
Original Assignee
Foshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan University filed Critical Foshan University
Priority to CN201510761336.4A priority Critical patent/CN105389557A/en
Publication of CN105389557A publication Critical patent/CN105389557A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/243Aligning, centring, orientation detection or correction of the image by compensating for image skew or non-uniform image deformations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Abstract

The present invention provides an electronic official document classification method based on multi-region features, comprising: an image preprocessing step in which image gray processing, image adaptive filtering, image gray stretch, image optimal threshold calculating and image binaryzation are successively performed; a region feature extraction step in which image block pixel distribution statistics features, smooth image histogram features, and image texture features are extracted; a standard document multi-region feature extraction and storage step in which standard document image preprocessing, standard document image key region selection, standard document image region feature extraction, and generation of a document type feature matrix are successively performed; and a document type identification step in which the document type feature matrix and the corresponding feature region are read from a database, the feature region image corresponding to the detected document image is obtained, a feature vector of each feature region of the detected document image is calculated, and a document type similarity is calculated according to a correlation coefficient matrix of two features. The electronic official document classification method of the present invention can perform accurate classification or identification for government official documents, and is simple in operation and convenient to realize.

Description

A kind of electronic government documents sorting technique based on multi-region feature
Technical field
The present invention relates to a kind of electronic government documents sorting technique based on multi-region feature, especially for the type identification of government document picture.
Background technology
Government document and administrative document are the abbreviations of public affair document, are that the mankind are administering the practical writing with legal authority and cannonical format used in the public affair practice of society, management country.As the topmost carrier of statement will of the state, law enforcement regulation, specification administrative law enforcement, transmission important information, to some degree, official document is the continuity of state's laws regulation and supplements.Its type generally comprises: resolution, determine, order (order), publication, bulletin, notice, suggestion, notice, circular, report, ask for instructions, give an written reply, proposal, letter, summary etc.
Along with the development of E-government, government networking, informationization, electronization day by day prevailing.In order to improve efficiency of the government, the automatic classification or the identification that realize government's electronic government documents become problem demanding prompt solution.
At present, the classification of electronic government documents is mainly limited to the classification of type of electronic document both at home and abroad, not yet has the classification of the electronic government documents of image content-based feature or recognition system or method.
Due to the document that government document is formal, have and compare set form and space of a whole page requirement.Such as: the form key element of administrative document can be divided into eyebrow head, main body, version remember three parts.It is first that each more than red anti-line key element is referred to as eyebrow; Red anti-line (not containing) is to be referred to as main body down to each key element between descriptor (not containing); Each key element below descriptor is referred to as version note.Wherein, version head by issued organ's full name or standardization abbreviation add " file " two word or add band bracket indicate language title form, be imprinted on official document homepage top between two parties with the large word of red chromatography.Associating style of writing, availablely sponsors office name, also can and with co-signing office name.Documment number is made up of authority subculture character, time and dispatch sequence number.Therefore, these key elements of official document can be used as the key point of official document type identification.
Summary of the invention
Instant invention overcomes shortcoming of the prior art, provide a kind of electronic government documents sorting technique based on multi-region feature, the method can realize the ability identified the type of the electronic government documents of existing format masterplate.
In order to solve the problems of the technologies described above, the present invention is achieved by the following technical solutions:
Based on an electronic government documents sorting technique for multi-region feature, comprise the following steps:
1) Image semantic classification
(1) image gray processing: because the electronic government documents image obtained is generally coloured image, for making process simplify, must chromatic information be transformed in gray space;
(2) image adaptive filter: the noise filtering being realized the electronic government documents to shooting by adaptive median filter;
(3) gradation of image stretches: in real image, and often there will be larger illumination different, therefore gradation of image is inconsistent, and this occurs larger error by causing subsequent treatment.By carrying out gray scale stretching to image, the adjustment of various gray-scale pixels distribution can be realized, being conducive to the deviation improving the gradation of image that image irradiation condition deficiency causes;
(4) image optimal threshold calculates: when real image binaryzation, and after carrying out binaryzation to the image of different light, its result images gap is normal larger.The adaptive threshold that the present invention realizes image by iterative algorithm calculates, and reduces the impact of image binaryzation result by illumination condition, ensures the consistance after official document binaryzation, thus ensure the accuracy of official document identification;
(5) image binaryzation: image is converted into the image only having black, white two kinds of colors;
(6) image skew correction: by the straight line of hough change detection angle within the scope of 0 ~ 5 °, realize image skew correction.
2) Region Feature Extraction
(1) image block pixel distribution statistical nature;
(2) smoothed image histogram feature;
(3) image texture characteristic;
3) standard document multi-region feature extracts and warehouse-in
(1) standard document Image semantic classification;
(2) standard document image critical area is selected;
(3) each Region Feature Extraction of standard document image, obtains each provincial characteristics vector;
(4) Doctype eigenmatrix is generated;
4) Doctype identification
(1) Doctype eigenmatrix and characteristic of correspondence region is read from database;
(2) tested file and picture character pair area image is obtained;
(3) each characteristic area proper vector of tested file and picture is calculated;
(4) eigenmatrix of tested document is calculated;
The correlation matrix of (5) two features, calculates Doctype similarity, and similarly is no foundation unanimously using this value as process decision chart.
Further, described image binaryzation is: first by carrying out gray scale stretching to image and gray scale is smoothly corrected, and then adopts optimal threshold method to carry out image binaryzation.
Further, described image block pixel distribution statistical nature is: first, to the further piecemeal of each area image; Then, add up number of pixels in each piecemeal respectively, calculate its accounting in area image.Finally, distribution statistics histogram is generated.
Further, described image smoothing image histogram is characterized as: first, and area image carries out Gaussian smoothing; Then, difference zoning gradation of image distribution histogram.
Further, described image texture characteristic is: first, and area image carries out Gaussian smoothing; Then, surf unique point and the proper vector of zoning image is distinguished.
Further, described standard document multi-region feature is: the feature extraction region each critical area of file and picture being set to document classification, by extracting the statistical nature of area image to each area image.
Compared with prior art, the invention has the beneficial effects as follows:
A kind of electronic government documents sorting technique based on multi-region feature of the present invention, can classify accurately to government document or identify, method be simple to operate, and it is convenient to realize.And there is wider applicability.Multiple popular image format file can be applicable to, support the multiple image files such as various colour, gray-scale map, the official document type that identifiable design has been put in storage simultaneously.This method can be adapted to multiple illumination condition, all can good self-adaptation to different light and shade and conditions of exposure.Energy automatic analysis background colour order range, and effectively eliminate the impact of background image for document classification.There is good robustness to rotation and noise, and better can resist the impact of neighbourhood noise.Have good accuracy rate and speed, error rate is low.
Accompanying drawing explanation
Accompanying drawing is used to provide a further understanding of the present invention, together with embodiments of the present invention for explaining the present invention, is not construed as limiting the invention, in the accompanying drawings:
Fig. 1 is the process flow diagram of Image semantic classification of the present invention.
Fig. 2 is area image feature extraction process flow diagram of the present invention.
Fig. 3 is that file characteristics of the present invention extracts and feature into base process flow diagram.
Fig. 4 is electronic government documents Doctype identification process figure of the present invention.
Fig. 5 ~ Fig. 7 is document recognition design sketch.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.
Fig. 1 to 3 is process flow diagrams of a kind of electronic government documents sorting technique based on multi-region feature of the present invention.
This method be input as electronic government documents image to be identified and standard electronic official document template image, the similarity result of output document identification.Consult Fig. 4.
1, implementation process
1) standard official document image typing
(1) electronic government documents image is read.Read electronic government documents image, image type can be JPG, BMP or other common format image file.
(2) Image semantic classification.To original image gray processing, gradation of image stretches, filtering and noise reduction, binaryzation, image skew correction etc.
(3) image-region is arranged.According to the type of electronic government documents and feature, characteristic area is set.
(4) extract each provincial characteristics, calculate electronic government documents eigenmatrix.
(5) by characteristic area and eigenmatrix stored in database.
2) tested official document type identification
(1) from database, read characteristic area and the eigenmatrix of each standard official document image respectively.
(2) eigenmatrix is calculated to the character pair region of tested official document image.
(3) file characteristics matrix is carried out similarity-rough set.
(4) official document Doctype numbering is obtained.
2 embodiments
[embodiment 1] as shown in Figure 5.Accurately can carry out the identification of Doctype when the tested image of embodiment 1 is identical with standard picture, similarity result is 1, and namely tested official document image is identical with standard official document image.
[embodiment 2] as shown in Figure 6.Accurately can carry out the identification of Doctype when tested image is different from standard picture in embodiment 2, similarity result is 0.17, and namely tested official document image is not identical with standard official document image.
[embodiment 3] as shown in Figure 7.Also accurately can carry out the identification of Doctype when tested image is different from standard picture in embodiment 3, similarity result is 0.2, and namely tested official document image is not identical with standard official document image.
Last it is noted that these are only the preferred embodiments of the present invention; be not limited to the present invention; although with reference to embodiment to invention has been detailed description; for a person skilled in the art; it still can be modified to the technical scheme described in foregoing embodiments; or equivalent replacement is carried out to wherein portion of techniques feature; but it is within the spirit and principles in the present invention all; any amendment of doing, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (6)

1. based on an electronic government documents sorting technique for multi-region feature, it is characterized in that, comprise the following steps: 1) Image semantic classification
(1) image gray processing;
(2) image adaptive filter;
(3) gradation of image stretches;
(4) image optimal threshold calculates;
(5) image binaryzation;
(6) image skew correction;
2) Region Feature Extraction
(1) image block pixel distribution statistical nature;
(2) smoothed image histogram feature;
(3) image texture characteristic;
3) standard document multi-region feature extracts and warehouse-in
(1) standard document Image semantic classification;
(2) standard document image critical area is selected;
(3) each Region Feature Extraction of standard document image, obtains each provincial characteristics vector;
(4) Doctype eigenmatrix is generated;
4) Doctype identification
(1) Doctype eigenmatrix and characteristic of correspondence region is read from database;
(2) tested file and picture character pair area image is obtained;
(3) each characteristic area proper vector of tested file and picture is calculated;
(4) eigenmatrix of tested document is calculated;
The correlation matrix of (5) two features, calculates Doctype similarity.
2. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described image binaryzation is: first by carrying out gray scale stretching to image and gray scale is smoothly corrected, and then adopts optimal threshold method to carry out image binaryzation.
3. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described image block pixel distribution statistical nature is: first, to the further piecemeal of each area image; Then, add up number of pixels in each piecemeal respectively, calculate its accounting in area image, finally, generate distribution statistics histogram.
4. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described image smoothing image histogram is characterized as: first, and area image carries out Gaussian smoothing; Then, difference zoning gradation of image distribution histogram.
5. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described image texture characteristic is: first, and area image carries out Gaussian smoothing; Then, surf unique point and the proper vector of zoning image is distinguished.
6. a kind of electronic government documents sorting technique based on multi-region feature according to claim 1, it is characterized in that, described standard document multi-region feature is: the feature extraction region each critical area of file and picture being set to document classification, by extracting the statistical nature of area image to each area image.
CN201510761336.4A 2015-11-10 2015-11-10 Electronic official document classification method based on multi-region features Pending CN105389557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510761336.4A CN105389557A (en) 2015-11-10 2015-11-10 Electronic official document classification method based on multi-region features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510761336.4A CN105389557A (en) 2015-11-10 2015-11-10 Electronic official document classification method based on multi-region features

Publications (1)

Publication Number Publication Date
CN105389557A true CN105389557A (en) 2016-03-09

Family

ID=55421829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510761336.4A Pending CN105389557A (en) 2015-11-10 2015-11-10 Electronic official document classification method based on multi-region features

Country Status (1)

Country Link
CN (1) CN105389557A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344815A (en) * 2018-12-13 2019-02-15 深源恒际科技有限公司 A kind of file and picture classification method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1973299A (en) * 2004-08-19 2007-05-30 三菱电机株式会社 Image search method and image search device
CN101447017A (en) * 2008-11-27 2009-06-03 浙江工业大学 Method and system for quickly identifying and counting votes on the basis of layout analysis
CN102663403A (en) * 2012-04-26 2012-09-12 北京工业大学 System and method used for extracting lane information in express way intelligent vehicle-navigation and based on vision
CN103208001A (en) * 2013-02-06 2013-07-17 华南师范大学 Remote sensing image processing method combined with shape self-adaption neighborhood and texture feature extraction
CN103226616A (en) * 2013-05-16 2013-07-31 南京龙渊微电子科技有限公司 Image content retrieval system and image content sparse learning method thereof
CN104217221A (en) * 2014-08-27 2014-12-17 重庆大学 Method for detecting calligraphy and paintings based on textural features

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1973299A (en) * 2004-08-19 2007-05-30 三菱电机株式会社 Image search method and image search device
CN101447017A (en) * 2008-11-27 2009-06-03 浙江工业大学 Method and system for quickly identifying and counting votes on the basis of layout analysis
CN102663403A (en) * 2012-04-26 2012-09-12 北京工业大学 System and method used for extracting lane information in express way intelligent vehicle-navigation and based on vision
CN103208001A (en) * 2013-02-06 2013-07-17 华南师范大学 Remote sensing image processing method combined with shape self-adaption neighborhood and texture feature extraction
CN103226616A (en) * 2013-05-16 2013-07-31 南京龙渊微电子科技有限公司 Image content retrieval system and image content sparse learning method thereof
CN104217221A (en) * 2014-08-27 2014-12-17 重庆大学 Method for detecting calligraphy and paintings based on textural features

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344815A (en) * 2018-12-13 2019-02-15 深源恒际科技有限公司 A kind of file and picture classification method
CN109344815B (en) * 2018-12-13 2021-08-13 深源恒际科技有限公司 Document image classification method

Similar Documents

Publication Publication Date Title
Singh Optical character recognition techniques: a survey
US8494273B2 (en) Adaptive optical character recognition on a document with distorted characters
CN102542660B (en) Bill anti-counterfeiting identification method based on bill watermark distribution characteristics
Moghaddam et al. Application of multi-level classifiers and clustering for automatic word spotting in historical document images
CN105825211B (en) Business card identification method, apparatus and system
CN106570475B (en) A kind of dark-red enameled pottery seal search method
CN108053545B (en) Certificate verification method and device, server and storage medium
CN106503694B (en) Digit recognition method based on eight neighborhood feature
CN103577818A (en) Method and device for recognizing image characters
Tian et al. Natural scene text detection with MC–MR candidate extraction and coarse-to-fine filtering
CN111860525B (en) Bottom-up optical character recognition method suitable for terminal block
CN109002768A (en) Medical bill class text extraction method based on the identification of neural network text detection
CN112508011A (en) OCR (optical character recognition) method and device based on neural network
CN109255414A (en) A kind of colour barcode made an inventory for books, books recognition methods, electronic equipment and storage medium
CN103996055A (en) Identification method based on classifiers in image document electronic material identification system
CN116740723A (en) PDF document identification method based on open source Paddle framework
CN109446997A (en) Document code automatic identifying method
CN109034154A (en) The extraction and recognition methods of Invoice Seal duty paragraph
CN115713772A (en) Transformer substation panel character recognition method, system, equipment and storage medium
US20220215679A1 (en) Method of determining a density of cells in a cell image, electronic device, and storage medium
CN105117704A (en) Text image consistency comparison method based on multiple features
CN109508712A (en) A kind of Chinese written language recognition methods based on image
CN105389557A (en) Electronic official document classification method based on multi-region features
Ovodov Optical Braille recognition using object detection CNN
CN116343237A (en) Bill identification method based on deep learning and knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160309