CN105426884A

CN105426884A - Fast document type recognition method based on full-sized feature extraction

Info

Publication number: CN105426884A
Application number: CN201510761290.6A
Authority: CN
Inventors: 王东; 陈俊健; 李晓东; 顾艳春
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2015-11-10
Filing date: 2015-11-10
Publication date: 2016-03-23

Abstract

The invention provides a fast document type recognition method based on full-sized feature extraction. The method comprises the following steps of document image preprocessing, including image zooming, graying and noise filtering; document image feature extraction, including Hessian matrix building, scale space generation, primary determination of feature points, precise positioning of the feature points, main direction determination of selected feature points, feature point descriptor construction and feature value string generation; document image feature value comparison, including document similarity calculation and comparison algorithm optimization. According to the method, an image is preprocessed through software, and additional hardware equipment does not need to be added. According to the method, the scale-invariant feature is creatively introduced for improving a typical SURF feature extraction algorithm, so that the problem of matching failure due to error amplification of the SURF algorithm caused by scale variations is fundamentally solved. The method has the advantage that a multi-thread technology and a large cache are used for solving the problems of large data volume calculation during comparison and the harsh time requirement of a user on an electronic government affair platform.

Description

A kind of quick file kind identification method based on full width feature extraction

Technical field

The invention belongs to computer image recognition technology field, be specifically related to a kind of quick file kind identification method based on full width feature extraction.

Background technology

E-government Platform is the forward position of the external informatized office work of government department, every day, respective government agencies needed a large amount of form and the photocopying materials that receive user's submission, if these materials are with manually going to identify Doctype, need the time spending a large amount of manpowers and delay process, the material be difficult to user submits to carries out effective Classification Management, more advanced recognition technology beyond doubt eager demand.

The document recognition software that current E-government Platform is run mainly carries out some Text region work, to alleviate the words input work of staff with OCR is auxiliary.But how identify whether the material of the unknown identification that user submits to meets the requirement that file specifies and cannot process owing to not having corresponding software to carry out supporting, and this with no paper for government department, one-stop office creates obstacle fast.If can not only realize the shooting to document on government affair platform, and the document that can realize user submits to identifies fast, and whether prompting user submits to material accurately or omit, and will contribute to work efficiency and the image of Improving Government department.

The mathematics essence of Doctype identification problem belongs to the mapping problems of model space to classification space.At present, main employing three kinds of recognition methodss both at home and abroad: statistical-simulation spectrometry, configuration mode identification, Fuzzy Pattern Recognition.From 20 century 70s, its research has had the history of decades, is all subject to the great attention of people always, proposes thousands of algorithms so far by means of various theory.

A deficiency of these Doctype recognition technologies existing is exactly adaptive performance difference.The file and picture collected by the camera-shooting and recording device of E-government Platform is usually containing situations such as bright and dark light inequality, noise, can do not identify out even completely once destination document is had larger aberration often must not go out desirable result by stronger noise pollution or destination document, the problem that existing method ubiquity more efficiency consuming time is lower simultaneously.Therefore, the Doctype recognition methods that research is quick and precisely adapted under multiple environment seems very important

Summary of the invention

The present invention is directed on the file and picture that will identify and have noise, uneven illumination, image have rotate and the situation such as Texturized time, existing Doctype recognition methods the deleterious even problem of complete failure, propose a kind of quick file kind identification method based on full width feature extraction, it effectively can solve illumination, noise, distortion etc. are for the impact of Doctype identification, and for rotation, the phenomenon such as curling has good robustness, and all identify accurately in multiple photoenvironment, this method arithmetic speed is very fast simultaneously, can the higher occasion of requirement of real time.

In order to solve the problems of the technologies described above, the present invention is achieved by the following technical solutions:

Based on a quick file kind identification method for full width feature extraction, comprise the following steps:

1) file and picture pre-service

(1) convergent-divergent of image;

(2) gray processing of image;

(3) brightness of image equalization;

(4) picture noise filtering;

2) file and picture feature extraction

(1) Hessian Hessen matrix builds;

(2) metric space generates;

(3) unique point and precise positioning feature point is tentatively determined;

(4) selected characteristic point principal direction is determined;

(5) structural attitude point describes operator;

(6) generating feature value string;

3) comparison of file and picture eigenwert

(1) Documents Similarity calculates;

(2) optimization of alignment algorithm.

Further, the pre-service of described file and picture is: carry out pre-service to coloured image, comprises the convergent-divergent of image, gray processing, luminance proportion, noise filtering, makes it size, pacing items that colourity, contrast meet document recognition.

Because the image of the camera-shooting and recording device collection of E-government Platform belongs to high-resolution image.If directly bring the original image as feature extraction, the feature point number of image zooming-out may be caused more than 1000, this is little to the precision raising effect identified by making the time of Doctype identification greatly increase.For improving the speed of feature extraction, be necessary to carry out convergent-divergent process to original image.

Submit to the mode of document different because user puts, the situation of surrounding environment illumination is different, and also may occur the phenomenon that paper is curling, the image of the camera-shooting and recording device collection of E-government Platform there will be the situation that intensity of illumination is unbalanced, entire image brightness differs.Brightness of image equalization algorithm will process accordingly for these situations.

Further, described file and picture feature extraction selects SURF as feature extraction algorithm, and described SURF extraction algorithm adopts the feature of Scale invariant.

Feature extraction algorithm has a variety of, for this applied environment of document classification identification, this method has tried out SURF and SIFT algorithm, these two kinds of method comparison are similar, and SIFT method comparison is stablized, and detect unique point more, but complexity is higher, and SURF wants computing simple, efficiency is high, and operation time is shorter.Because E-government Platform is very high to requirement of real-time, so select SURF as the main algorithm of feature extraction.

These technological difficulties are that the change of SURF algorithm to yardstick is more responsive, if the extraction of eigenwert can not solve scale invariability, even if the very little difference of size also can cause large volume document classification identification error.Be necessary that this specifically identifies that scene is improved SURF algorithm for document.

The feature that this method introduces Scale invariant is improved classical SURF algorithm, and main thought is the size factor of each unique point detected along with correspondence.When we think coupling different images, often can run into the different problem of graphical rule, in different images, the distance of unique point dissimilates, and object becomes different sizes, if we are by revising the size of unique point, will cause Strength mis match.In order to address this problem, this method proposes the SURF feature detection of a Scale invariant, among when calculating unique point, scale factor being added.

Further, the comparison of described file and picture eigenwert is: according to the feature of file and picture, adopts Euclidean distance as the rudimentary algorithm of file and picture eigenwert comparison, adopts multithreading and large buffer memory to realize the lifting at double of comparison speed simultaneously.

Further, describedly tentatively determine that unique point and precise positioning feature point comprise:

The each pixel crossed through hessian matrix disposal is carried out size with its 3 26 points tieing up field compare, if it is maximal value in these 26 points or minimum value, then remain, as preliminary unique point, use in testing process and detect with the wave filter of the corresponding size of this scale layer image analytic degree; Then adopt 3 dimensional linear method of interpolation to obtain the unique point of sub-pixel, also remove the point that those values are less than certain threshold value simultaneously, increase the unique point quantity minimizing that extreme value makes to detect, finally only have several feature point of maximum intensity to be detected.

Further, the wave filter of 3 × 3, in this scale layer image one of 9 pixels detect all the other 8 points in unique point and self scale layer and on it and under two scale layer, 9 points compare, totally 26 points, if the eigenwert of pixel is greater than surrounding pixel, can determine that this point is the unique point in this region.

Further, described selected characteristic point principal direction is determined:

In order to ensure rotational invariance, in SURF, Harr wavelet character in statistical nature point field, namely centered by unique point, calculating radius is in the neighborhood of 6S (S is the scale-value at unique point place), add up 60 degree fan-shaped interior a little in the little wave response summation of the Haar in horizontal and vertical direction, and compose Gauss's weight coefficient to these responses, make the response contribution near unique point large, and it is little away from the response contribution of unique point, then the response within the scope of 60 degree is added to form new vector, travel through whole border circular areas, the direction selecting most long vector is the principal direction of this unique point, calculated one by one by unique point, obtain the principal direction of each unique point.

Further, described structural attitude point describes operator:

In SURF, also be around unique point, get a square-shaped frame, the length of side of frame is 20S (S is the yardstick at this unique point place detected), this frame band direction is exactly the principal direction that the 4th step detects, then this frame is divided into 16 sub regions, every sub regions adds up the horizontal direction of 25 pixels and the Haar wavelet character of vertical direction, here all relative principal direction in horizontal and vertical direction, this Haar wavelet character is horizontal direction value sum, horizontal direction absolute value sum, vertical direction sum, vertical direction absolute value sum.

Further, described generating feature value string is: it is a two-dimentional Vector Groups that SURF algorithm extracts the eigenwert of file and picture, and every a line of Vector Groups represents a unique point, and the row of Vector Groups represent the eigenwert of each unique point.These eigenwerts are all form by paired floating number.

Further, described Documents Similarity calculates and adopts Euclidean distance, convert Euclidean distance to similarity, the span of the similarity of regulation all properties is [0, 1], it is 0 that the ultimate range of the attribute of retrieving images and database Plays image is mapped as similarity, it is 1 that minor increment is mapped as similarity, and similarity is the strictly decreasing function of distance, by comparison one by one, namely final finding in the standard picture of database represent comparison success with that file and picture that retrieving images similarity is the highest, the document information that user provides to E-government Platform effectively and be referred to corresponding type by system, if standard picture is all very low with the similarity of retrieving images in a database, represent that the document information that user provides is wrong, system prompts user submittal error document.

Euclidean distance is modal distance metric, and cosine similarity is then modal measuring similarity, and a lot of distance metrics and measuring similarity are all the distortion and derivative based on both.According to Euclidean distance and cosine similarity account form separately with weigh feature, be applicable to different Data Analysis Model respectively.

These technological difficulties how to improve the comparison speed of eigenwert.This method, according to the feature of file and picture, adopts Euclidean distance as the rudimentary algorithm of file and picture eigenwert comparison.Because the number of users of E-government Platform service is more, but need the calculating carrying out Euclidean distance with each file and picture during file and picture comparison, total comparison time directly determines the time that user obtains system responses.This method adopts multithreading and large buffer memory to realize the lifting at double of comparison speed, well solves the problem of comparison speed bottle-neck.

Compared with prior art, the invention has the beneficial effects as follows:

The present invention has following features:

(1) system does not need E-government Platform to increase extra hardware device to realize the identification of Doctype, does not require to adopt senior camera-shooting and recording device yet, and system is carried out pre-service by software to image and improved accuracy of identification.

(2) the SURF feature extraction algorithm of novelty introducing scale invariant feature to classics improves, among when calculating unique point, scale factor being added.Thus fault in enlargement causes the unsuccessful problem of coupling to be able to basic solution because of dimensional variation to make SURF algorithm.

(3) make full use of CPU and the memory source of computer system, the big data quantity calculating produced when taking multithreading and large buffer memory to solve comparison and user are to the time requirement of E-government Platform harshness.

Accompanying drawing explanation

Accompanying drawing is used to provide a further understanding of the present invention, together with embodiments of the present invention for explaining the present invention, is not construed as limiting the invention, in the accompanying drawings:

Fig. 1 is embodiment 1 locating effect figure;

Fig. 2 is embodiment 2 locating effect figure;

Fig. 3 is embodiment 3 locating effect figure;

Fig. 4 is embodiment 4 locating effect figure.

Embodiment

Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.

First this method needs the pre-service carrying out file and picture, then application characteristic algorithm extracts characteristics of image, in order to the feature point detection that realizes scale invariability with mate, SURF algorithm first utilizes Hessian matrix determination candidate point, then carry out non-maximum restraining, computation complexity reduces many.Carry out the comparison of characteristics of image after feature point extraction again, just can identify accurately Doctype after these steps.This document kind identification method, comprises three parts altogether, as shown in Figure 1, is respectively file and picture pre-service, file and picture feature extraction and file and picture aspect ratio pair.

This patent proposition utilization is carried out pretreated method to file and picture and is solved the problem that original image is fuzzy, shading value is inconsistent, and the mode step-by-step analysis image-regions such as analysis simultaneously, achieve good effect.

1, implementation process

1) file and picture pre-service

(1) convergent-divergent of image.Convergent-divergent is by a certain percentage carried out to the original image that the camera-shooting and recording device of E-government Platform collects, makes the resolution of image control can meet identification demand in the resolution of 680*480.

(2) gray processing of image.The colouring information of original image does not help for feature extraction, for improving the speed of feature extraction, is necessary to carry out gray processing to original image.By red, green, blue three-component is weighted and mode realize to image gray processing process.

(3) brightness of image equalization.By the overall situation of computed image gray scale and the relation of mean value locally and variance, image is divided into bright, dark two parts, again Adaptive contrast enhancement and multiplying constant process are carried out respectively to two parts, finally obtain the enhancing image of original image in conjunction with subtracting background method.Brightness of image equalization can the problem of correcting image uneven brightness, enhances the sharpness of image, improves document classification accuracy of identification.

(4) picture noise filtering.The camera-shooting and recording device of E-government Platform inevitably produces irregular undesired signal when gathering, and some noise signal is insensitive with human eye observation, but is fatal for computer recognizing.File and picture after noise filtering not only can reduce the interference of noise, and can retain the edge of image and sharp-pointed details preferably, and accuracy of identification significantly improves.

2) file and picture feature extraction

(1) Hessian Hessen matrix builds

SURF algorithm adopts HessianMatrix to carry out the extraction of unique point, so Hessen matrix is the core of SURF algorithm.Suppose function f (x, y), Hessian matrix H is made up of function partial derivative.The value of discriminant is the eigenwert of H matrix, and the symbol of result of determination can be utilized to classify a little, positive and negative according to discriminant value, always differentiates the value of this yes or no limit.In SURF algorithm, usually replace functional value f (x, y) with image pixel I (x, y).Then select second order standard gaussian function as wave filter.

This method in order to accelerate computing approximate processing, and can adopt integrogram to carry out computing, accelerates speed greatly.

(2) metric space generates

The metric space of image is this width image expression under different resolution.In SURF algorithm, the size of picture is that the picture change Gaussian Blur size to be detected that always constant, different octave layer obtains obtains, and the Gaussian template yardstick that picture individual in same octave is used is also different.Algorithm allows metric space multi-layer image to be processed simultaneously, does not need to carry out double sampling to image, thus improves the performance of algorithm.

Traditional algorithm can set up a pyramidal structure, and the size of image is change, and computing meeting Reusability Gaussian function is to the smoothing process in sublayer, and this method makes original image remain unchanged and only change wave filter size.Adopt and save down-sampled process in this way, its processing speed has also just been put on naturally.

(3) unique point and precise positioning feature point is tentatively determined

The each pixel crossed through hessian matrix disposal is carried out size with its 3 26 points tieing up field compare, if it is maximal value in these 26 points or minimum value, then remain, as preliminary unique point.Use in testing process and detect with the wave filter of the corresponding size of this scale layer image analytic degree, for the wave filter of 3 × 3, in this scale layer image one of 9 pixels detect all the other 8 points in unique point and self scale layer and on it and under two scale layer, 9 points compare, totally 26 points, if the eigenwert of pixel is greater than surrounding pixel, can determine that this point is the unique point in this region.

Then adopt 3 dimensional linear method of interpolation to obtain the unique point of sub-pixel, also remove the point that those values are less than certain threshold value simultaneously, increase the unique point quantity minimizing that extreme value makes to detect, finally only have several feature point of maximum intensity to be detected.

(4) selected characteristic point principal direction is determined

In order to ensure rotational invariance, in SURF, do not add up its histogram of gradients, but the Harr wavelet character in statistical nature point field.Namely centered by unique point, calculating radius is in the neighborhood of 6S (S is the scale-value at unique point place), add up 60 degree fan-shaped interior a little in the little wave response summation of the Haar in horizontal and vertical direction, and compose Gauss's weight coefficient to these responses, make the response contribution near unique point large, and it is little away from the response contribution of unique point, then the response within the scope of 60 degree is added to form new vector, travel through whole border circular areas, the direction selecting most long vector is the principal direction of this unique point.Like this, calculated one by one by unique point, obtain the principal direction of each unique point.

(5) structural attitude point describes operator

In SURF, be also around unique point, get a square-shaped frame, the length of side of frame is 20S (S is the yardstick at this unique point place detected).This frame band direction is exactly the principal direction that the 4th step detects.Then this frame is divided into 16 sub regions, every sub regions adds up the horizontal direction of 25 pixels and the Haar wavelet character of vertical direction, all relative principal direction in horizontal and vertical direction here.This Haar wavelet character is horizontal direction value sum, horizontal direction absolute value sum, vertical direction sum, vertical direction absolute value sum.

SURF adopts Henssian matrix acquisition image local to be worth still very stable most, but the gradient direction asking the principal direction stage too to rely on regional area pixel, likely make the principal direction that finds inaccurate, characteristic vector pickup below and coupling all depend critically upon principal direction, even if little misalignment angle also can cause the fault in enlargement of characteristic matching below, thus mates unsuccessful.

This method is improved classical SURF algorithm, adopts compromise solution, namely gets appropriate layer and then carry out interpolation.Classical SURF algorithm is made enough closely not make yardstick have error because the layer of image pyramid obtains like this, thus make characteristic vector pickup below owing to relying on identical yardstick and fault in enlargement, finally cause Doctype to mate unsuccessful problem and solved.

(6) generating feature value string

It is a two-dimentional Vector Groups that SURF algorithm extracts the eigenwert of file and picture, and every a line of Vector Groups represents a unique point, and the row of Vector Groups represent the eigenwert of each unique point.These eigenwerts are all form by paired floating number.The feature of these standard pictures final needs to be stored in database to wait for further comparison, therefore the form that the Vector Groups of this two dimension will be converted to character string stores.If E-government Platform has requirement to storage space or safety, also this character string to be carried out compressing or encryption.

3) comparison of file and picture eigenwert

(1) Documents Similarity calculates

Euclidean distance can embody the antipode of number of individuals value tag, so more for needing the analysis embodying difference from the numerical values recited of dimension, and the similarity be worth as user behavior index analysis user or difference; And cosine similarity is more distinguish difference from direction, and it is insensitive to absolute numerical value, more similarity and the difference for user, content scores being distinguished to user interest, have modified the skimble-scamble problem of the module that may exist between user (because cosine similarity is insensitive to absolute figure) simultaneously.Feature due to SURF algorithm extraction file and picture is a two-dimentional Vector Groups, and the computing being applicable to this Vector Groups is compared in the calculating of Euclidean distance, therefore this method adopts Euclidean distance as the algorithm of file characteristics comparison.

After the Euclidean distance calculating standard picture and retrieving images, its distance be converted to similarity.The span of the similarity of regulation all properties is [0,1], it is 0 that the ultimate range of the attribute of retrieving images and database Plays image is mapped as similarity, and it is 1 that minor increment is mapped as similarity, and similarity is the strictly decreasing function of distance.By comparison one by one, finally find in the standard picture of database and namely represent comparison success with that file and picture that retrieving images similarity is the highest, the document information that user provides to E-government Platform effectively and be referred to corresponding type by system.If standard picture is all very low with the similarity of retrieving images in a database, represent that the document information that user provides is wrong, system prompts user submittal error document.

(2) optimization of alignment algorithm

Can not index be set up to the file characteristics value of the Vector Groups form of this two dimension at present, in comparison process one by one, not have good optimized algorithm can improve comparison speed.In whole e-gov document classification identifying, standard picture only needs disposable extraction eigenwert and changes into corresponding character string to store just passable in a database, but because retrieving images needs to do Euclidean distance computing with each width standard picture, the process of comparison is very consuming time.The big data quantity that this method produces when taking multithreading and large buffer memory to solve eigenwert comparison calculates and the impact of big data quantity storage on E-government Platform efficiency.

2, embodiment

[embodiment 1] as shown in Figure 2.Embodiment 1 search file amplifies 15% than standard document, and experimental result shows, and can accurately identify.

[embodiment 2] as shown in Figure 3.Embodiment 2 search file increases by 15% than standard document brightness, and experimental result shows, and can accurately identify.

[embodiment 3] as shown in Figure 4.Embodiment 3 search file increases by 15% than standard document angle of inclination, and experimental result shows, and can accurately identify.

Last it is noted that these are only the preferred embodiments of the present invention; be not limited to the present invention; although with reference to embodiment to invention has been detailed description; for a person skilled in the art; it still can be modified to the technical scheme described in foregoing embodiments; or equivalent replacement is carried out to wherein portion of techniques feature; but it is within the spirit and principles in the present invention all; any amendment of doing, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1., based on a quick file kind identification method for full width feature extraction, it is characterized in that: comprise the following steps:

1) file and picture pre-service

(1) convergent-divergent of image;

(2) gray processing of image;

(3) brightness of image equalization;

(4) picture noise filtering;

2) file and picture feature extraction

(1) Hessian Hessen matrix builds;

(2) metric space generates;

(4) selected characteristic point principal direction is determined;

(5) structural attitude point describes operator;

(6) generating feature value string;

3) comparison of file and picture eigenwert

(1) Documents Similarity calculates;

(2) optimization of alignment algorithm.

2. a kind of quick file kind identification method based on full width feature extraction according to claim 1, it is characterized in that, the pre-service of described file and picture is: carry out pre-service to coloured image, comprise the convergent-divergent of image, gray processing, luminance proportion, noise filtering, make it size, pacing items that colourity, contrast meet document recognition.

3. a kind of quick file kind identification method based on full width feature extraction according to claim 1, it is characterized in that, described file and picture feature extraction selects SURF as feature extraction algorithm, and described SURF extraction algorithm adopts the feature of Scale invariant.

4. a kind of quick file kind identification method based on full width feature extraction according to claim 1, it is characterized in that, the comparison of described file and picture eigenwert is: according to the feature of file and picture, adopt Euclidean distance as the rudimentary algorithm of file and picture eigenwert comparison, adopt multithreading and large buffer memory to realize the lifting at double of comparison speed simultaneously.

5. a kind of quick file kind identification method based on full width feature extraction according to claim 1, it is characterized in that, described tentatively determines that unique point and precise positioning feature point comprise:

The each pixel crossed through hessian matrix disposal is carried out size with its 3 26 points tieing up field compare, if it is maximal value in these 26 points or minimum value, then remain, as preliminary unique point, use in testing process and detect with the wave filter of the corresponding size of this scale layer image analytic degree;

6. a kind of quick file kind identification method based on full width feature extraction according to claim 5, it is characterized in that, the wave filter of 3 × 3, in this scale layer image one of 9 pixels detect all the other 8 points in unique point and self scale layer and on it and under two scale layer, 9 points compare, totally 26 points, if the eigenwert of pixel is greater than surrounding pixel, can determine that this point is the unique point in this region.

7. a kind of quick file kind identification method based on full width feature extraction according to claim 1, it is characterized in that, described selected characteristic point principal direction is determined:

8. a kind of quick file kind identification method based on full width feature extraction according to claim 1, it is characterized in that, described structural attitude point describes operator:

9. a kind of quick file kind identification method based on full width feature extraction according to claim 1, it is characterized in that, described generating feature value string is: it is a two-dimentional Vector Groups that SURF algorithm extracts the eigenwert of file and picture, every a line of Vector Groups represents a unique point, and the row of Vector Groups represent the eigenwert of each unique point.These eigenwerts are all form by paired floating number.

10. a kind of quick file kind identification method based on full width feature extraction according to claim 1, it is characterized in that, described Documents Similarity calculates and adopts Euclidean distance, convert Euclidean distance to similarity, the span of the similarity of regulation all properties is [0, 1], it is 0 that the ultimate range of the attribute of retrieving images and database Plays image is mapped as similarity, it is 1 that minor increment is mapped as similarity, and similarity is the strictly decreasing function of distance, by comparison one by one, namely final finding in the standard picture of database represent comparison success with that file and picture that retrieving images similarity is the highest, the document information that user provides to E-government Platform effectively and be referred to corresponding type by system, if standard picture is all very low with the similarity of retrieving images in a database, represent that the document information that user provides is wrong, system prompts user submittal error document.