CN104866822B - A kind of file and picture rough sort method based on SIVV features - Google Patents

A kind of file and picture rough sort method based on SIVV features Download PDF

Info

Publication number
CN104866822B
CN104866822B CN201510227324.3A CN201510227324A CN104866822B CN 104866822 B CN104866822 B CN 104866822B CN 201510227324 A CN201510227324 A CN 201510227324A CN 104866822 B CN104866822 B CN 104866822B
Authority
CN
China
Prior art keywords
formula
sivv
file
image
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510227324.3A
Other languages
Chinese (zh)
Other versions
CN104866822A (en
Inventor
马廷淮
赵波
张正宇
霍晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Discipline Network Beijing Co ltd
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201510227324.3A priority Critical patent/CN104866822B/en
Publication of CN104866822A publication Critical patent/CN104866822A/en
Application granted granted Critical
Publication of CN104866822B publication Critical patent/CN104866822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Abstract

The present invention discloses a kind of file and picture rough sort method based on SIVV features, includes the following steps:Obtain file and picture, pretreatment, windowing process, calculate file and picture SIVV features, related coefficient with other documents SIVV features is calculated separately to each document, if related coefficient is all higher than a certain threshold value to wherein K document between any two, then it is assumed that this K document belongs to same classification.The purpose of the present invention is to propose a kind of new file and picture rough sort method, without accurate acquisition document text content, you can if file and picture is automatically separated into Ganlei according to the related coefficient of SIVV features, method has the characteristics that robustness is good, classification speed is fast.

Description

A kind of file and picture rough sort method based on SIVV features
Technical field
The invention belongs to document classification fields in document process, more particularly to a kind of to be based on SIVV (Spectral Image Validation and Verification, spectrum picture verification and certification) feature file and picture rough sort method.
Background technology
Document information amount in Internet era, network is huge, and the mode of manual sort can not complete extensive document It is great to carry out the automatic category significance of document using computer for classification work.It is figure that document resources in internet, which have significant portion, Piece or PDF format, and document content form is no longer limited to plain text, the accurate difficulty for obtaining text message in picture or PDF Spend it is larger, tradition it is poor for picture or PDF document classifying quality based on the classification of content of text.
Current main file and picture sorting technique can be divided into three classes, and the method based on text feature is based on characteristics of image Method and method based on composite character.
Document automatic classification technology based on content of text can be divided mainly into two classes:Knowledge based engineering classifies and based on statistics Classification (Sun Bin information extraction technologies summarize (in) [J] standardizations of terminology and information technology, 2002,4:008.).Based on knowing The Text Classification of knowledge needs a large amount of text classification rule, the number of required rule with system complexity exponentially on It rises, can not accomplish the exact classification to big data quantity.File classification method based on statistical method has, K neighbouring (Guo G, Wang H,Bell D,et al.KNN model-based approach in classification[M]//On The Move to Meaningful Internet Systems 2003:CoopIS,DOA,and ODBASE.Springer Berlin Heidelberg,2003:986-996.), support vector machines (SVM), naive Bayesian, decision tree, the methods of neural network.This A little sorting techniques are largely dependent upon the accuracy of the content of text of acquisition.For the document of picture or PDF format, especially It is that second-rate file and picture leads to not exact classification it is difficult to obtain accurate text message.
Sorting technique (Shin C, Doermann D, Rosenfeld A.Classification based on characteristics of image of document pages using structure-based features[J].International Journal on Document Analysis and Recognition,2001,3(4):232-247.), the characteristics of image of document is mainly utilized, Such as the description of grey level histogram, field color, textural characteristics, shape feature are classified.Common image classification method has, certainly Plan tree, support vector machines, genetic algorithm, Bayes, neural network etc..SIVV features that the present invention uses (Libert J M, Orandi S,Grantham J.A 1D Spectral Image Validation/Verification Metric for Fingerprints(NIST IR 7599),National Institute of Standards and Technology, Gaithersburg, MD, 2009 [J]) belong to characteristics of image, using the related coefficient of SIVV features, rough segmentation is carried out to document Class.
Method (Chen F, Girgensohn A, Cooper M, et al.Genre based on composite character identification for office document search and browsing[J].International Journal on Document Analysis and Recognition(IJDAR),2012,15(3):167-182.), in conjunction with Image, structure, the text feature of document carry out document classification.Often existence time complexity is high for mixed method, and classification speed is slow The problems such as.
Invention content
The present invention discloses a kind of file and picture rough sort method based on SIVV features, by file and picture according to SIVV features If related coefficient is automatically separated into Ganlei, have the characteristics that robustness is good, accuracy is high, classification speed is fast.It is specific as follows:
A kind of file and picture rough sort method based on SIVV features, steps are as follows:
(1) file and picture is pre-processed;
(2) adding window is carried out to every width figure respectively using with the same big 2D Blackman windows (as shown in Figure 1) of file and picture Processing;
(3) SIVV features (as shown in Figure 2) are calculated to image after adding window;
(4) related coefficient of the SIVV features of every width figure between any two is calculated;
(5) if correlation coefficient r is all higher than a certain threshold value to wherein K document between any two, then it is assumed that this K document belongs to same One classification.
In step (2), 2D Blackman windows obtain in the following way:
If length is that the expression formula of the one-dimensional Blackman windows of Q is as follows:
Wherein, Q=M, q indicate the serial number of pixel in the one-dimensional Blackman windows, and q=1,2 ..., Q;Q is one herein Tie up the length of Blackman windows;
By formula (1) and formula (1) multiplication cross, the matrix form of 2D Blackman windows is obtained.Q difference in two formula of multiplication or so Value is M and N, i.e., the size of image after adding window, M and N is identical as M and N in formula (31) formula (4) herein;
The detailed content of the step (3) is:
(31) spectrogram of image after utilizing formula (2) to calculate adding window:
Wherein,After h (x, y) expression adding windows in image at respective coordinates (x, y) Pixel value;M, the equal values of N are the size of image after adding window;Wherein i is imaginary unit;
(32) the two-dimentional normal state log power spectrum of image after using formula (3) to calculate adding window:
P (u, v)=| H (u, v) |2 (3)
(33) formula (3) is transformed under polar coordinate system using formula (4):
Power spectrum under polar coordinate system is indicated using P (ρ, θ), wherein ρ indicates equal to the out to out of image after adding window Point;
(34) it utilizes formula (5) that the P under all angle, θs sums, obtains the power spectrum about ρ:
To which the power spectrum about ρ of normal state is expressed as:
Wherein, ρ ∈ [0,0.5] periods/pixel.
In the step (32), the normalization methods of use have 10*lg P (u, v) or
The method calculates two SIVV characteristic correlation coefficients, r ∈ [0,1] using formula (7), and r values show two closer to 1 The possibility that document belongs to the same classification is bigger.
The threshold value of the method classification can be set according to specific tasks, if wherein K document correlation coefficient r between any two It is all higher than this threshold value, then it is assumed that this K document belongs to same classification.
Advantageous effect:
The present invention proposes a kind of new file and picture rough sort method, by file and picture according to SIVV characteristic correlation coefficients If being automatically separated into Ganlei, method has the characteristics that robustness is good, classification speed is fast.Internet document provider can be helped accurate Really, rough sort rapidly is carried out to extensive document.
Description of the drawings
Fig. 1 is the schematic diagram of 2D Blackman windows in the present invention;
Fig. 2 is the curve graph of the calculated file and picture SIVV features of the present invention.
Specific implementation mode
Below with reference to attached drawing, technical scheme of the present invention is described in detail.
A kind of file and picture rough sort method based on SIVV features, includes the following steps:
(1) file and picture is pre-processed;
Before document classification, it converts original document image to gray level image first, document effective coverage is partitioned into, to having It imitates region and carries out denoising, this operation object of pretreated image as subsequent step;
(2) adding window is carried out to every width figure respectively using with the same big 2D Blackman windows (as shown in Figure 1) of file and picture Processing;
Pretreated file and picture is carried out corresponding element with the 2D Blackman windows of same size to be multiplied, is added File and picture after window;Wherein, 2D Blackman windows obtain in the following way:
If length is that the expression formula of the one-dimensional Blackman windows of Q is as follows:
Wherein, Q=M, q indicate the serial number of pixel in the one-dimensional Blackman windows, and q=1,2 ..., Q.
By formula (1) and formula (1) multiplication cross, the matrix form of 2D Blackman windows, 2D Blackman windows such as Fig. 1 institutes are obtained Show.
(3) SIVV features, SIVV characteristic curve diagrams (as shown in Figure 2) are calculated to image after each adding window.It is specific to calculate Method is:
Such as formula (2), the spectrogram of image after adding window is calculated using discrete Fourier transform:
Wherein,After h (x, y) expression adding windows in image at respective coordinates (x, y) Pixel value;M, N indicates the size of image after adding window respectively.
The two-dimentional normal state log power spectrum of image after adding window is calculated using formula (3):
P (u, v)=| H (u, v) |2 (3)
Wherein, adoptable normalization methods have 10*lg P (u, v) or
The 2D power spectrum under formula (3) rectangular coordinate system are transformed under polar coordinate system using formula (4):
Power spectrum under polar coordinate system is indicated using P (ρ, θ), wherein ρ indicates equal to the out to out of image after adding window Point, value range is [0,0.5] period/pixel.
Finally, the P under all angle, θs is summed using formula (5), obtains the power spectrum about ρ:
To which the power spectrum about ρ of normal state is expressed as:
Wherein, ρ ∈ [0,0.5] periods/pixel.
(4) formula (7) is used to calculate two SIVV characteristic correlation coefficients, r ∈ [0,1], r values show two documents closer to 1 The possibility for belonging to the same classification is bigger.
(5) if correlation coefficient r is all higher than a certain threshold value to wherein K document between any two, then it is assumed that this K document belongs to same One classification.
Classification thresholds are set according to specific classification task, generally may be set between 0.7 to 0.9.
Above example is merely illustrative of the invention's technical idea, and protection scope of the present invention cannot be limited with this, every According to technological thought proposed by the present invention, any change done on the basis of technical solution each falls within the scope of the present invention Within.

Claims (1)

1. a kind of file and picture rough sort method based on SIVV features, which is characterized in that have the following steps:
(1) file and picture is pre-processed;
(2) windowing process is carried out to every width figure respectively using with the same big 2D Blackman windows of file and picture;
(3) SIVV features are calculated to image after adding window;
(4) related coefficient of the SIVV features of every width figure between any two is calculated;
(5) if correlation coefficient r is all higher than a certain threshold value to wherein K document between any two, then it is assumed that this K document belongs to same point Class;
In step (2), 2D Blackman windows obtain in the following way:
If length is that the expression formula of the one-dimensional Blackman windows of Q is as follows:
Wherein, q indicates the serial number of pixel in the one-dimensional Blackman windows, and q=1,2 ..., Q, and Q is one-dimensional herein The length of Blackman windows;
By formula (1) and formula (1) multiplication cross, to obtain the matrix form of two-dimentional Blackman windows, the Q difference in two formula of multiplication or so Value is M and N, i.e., the size of image after adding window, M and N is identical as M and N in formula (31) formula (4) herein;
The detailed content of the step (3) is:
(31) spectrogram of image after utilizing formula (2) to calculate adding window:
Wherein,Pixel after h (x, y) expression adding windows in image at respective coordinates (x, y) Value;M, the equal values of N are the size of image after adding window;Wherein i is imaginary unit;
(32) the two-dimentional normal state log power spectrum of image after using formula (3) to calculate adding window:
P (u, v)=| H (u, v) |2 (3)
(33) formula (3) is transformed under polar coordinate system using formula (4):
Power spectrum under polar coordinate system is indicated using P (ρ, θ), wherein the out to out of image after adding window is divided equally in ρ expressions;
(34) it utilizes formula (5) that the P under all angle, θs sums, obtains the power spectrum about ρ:
To which the power spectrum about ρ of normal state is expressed as:
Wherein, ρ ∈ [0,0.5] periods/pixel;
In the step (32), the normalization methods of use have 10*lgP (u, v) or
Two SIVV characteristic correlation coefficients, r ∈ [0,1] are calculated using formula (7), r values show that two documents belong to same closer to 1 The possibility of a classification is bigger,
The threshold value of classification is set according to specific tasks, if correlation coefficient r is all higher than this threshold value to wherein K document between any two, Then think that this K document belongs to same classification.
CN201510227324.3A 2015-05-06 2015-05-06 A kind of file and picture rough sort method based on SIVV features Active CN104866822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510227324.3A CN104866822B (en) 2015-05-06 2015-05-06 A kind of file and picture rough sort method based on SIVV features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510227324.3A CN104866822B (en) 2015-05-06 2015-05-06 A kind of file and picture rough sort method based on SIVV features

Publications (2)

Publication Number Publication Date
CN104866822A CN104866822A (en) 2015-08-26
CN104866822B true CN104866822B (en) 2018-08-24

Family

ID=53912643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510227324.3A Active CN104866822B (en) 2015-05-06 2015-05-06 A kind of file and picture rough sort method based on SIVV features

Country Status (1)

Country Link
CN (1) CN104866822B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5563403A (en) * 1993-12-27 1996-10-08 Ricoh Co., Ltd. Method and apparatus for detection of a skew angle of a document image using a regression coefficient
CN101136981A (en) * 2006-08-24 2008-03-05 夏普株式会社 Image processing method, image processing apparats, document reading apparatus, image forming apparatus
CN102750541A (en) * 2011-04-22 2012-10-24 北京文通科技有限公司 Document image classifying distinguishing method and device
CN102831244A (en) * 2012-09-13 2012-12-19 重庆立鼎科技有限公司 Method for classified search of house property file image
CN104036273A (en) * 2014-05-22 2014-09-10 南京信息工程大学 Fingerprint image segmentation method based on compositing window SIVV features

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5563403A (en) * 1993-12-27 1996-10-08 Ricoh Co., Ltd. Method and apparatus for detection of a skew angle of a document image using a regression coefficient
CN101136981A (en) * 2006-08-24 2008-03-05 夏普株式会社 Image processing method, image processing apparats, document reading apparatus, image forming apparatus
CN102750541A (en) * 2011-04-22 2012-10-24 北京文通科技有限公司 Document image classifying distinguishing method and device
CN102831244A (en) * 2012-09-13 2012-12-19 重庆立鼎科技有限公司 Method for classified search of house property file image
CN104036273A (en) * 2014-05-22 2014-09-10 南京信息工程大学 Fingerprint image segmentation method based on compositing window SIVV features

Also Published As

Publication number Publication date
CN104866822A (en) 2015-08-26

Similar Documents

Publication Publication Date Title
Prakasa Texture feature extraction by using local binary pattern
US8965127B2 (en) Method for segmenting text words in document images
US9384409B1 (en) Word segmentation for document image using recursive segmentation
CN104699772B (en) A kind of big data file classification method based on cloud computing
US8358837B2 (en) Apparatus and methods for detecting adult videos
Quiros et al. A kNN-based approach for the machine vision of character recognition of license plate numbers
US20140079316A1 (en) Segmentation co-clustering
CN104978521A (en) Method and system for realizing malicious code marking
CN102663435A (en) Junk image filtering method based on semi-supervision
Shekar et al. Kirsch directional derivatives based shot boundary detection: an efficient and accurate method
CN108197112A (en) A kind of method that event is extracted from news
CN114495139A (en) Operation duplicate checking system and method based on image
CN107392127B (en) Transmission line of electricity extracting method based on Chebyshev polynomials description
Yuan et al. Fast QR code detection based on BING and AdaBoost-SVM
CN110347827A (en) Event Distillation method towards isomery text operation/maintenance data
CN113221696A (en) Image recognition method, system, equipment and storage medium
CN104866822B (en) A kind of file and picture rough sort method based on SIVV features
Channoufi et al. Spatially constrained mixture model with feature selection for image and video segmentation
CN105512682A (en) Secret level marking identification method based on Krawtchouk moment and KNN-SMO classifier
Nasiri et al. A new binarization method for high accuracy handwritten digit recognition of slabs in steel companies
CN104408715A (en) SAR (Synthetic Aperture Radar) image analysis method based on self-adaptive fuzzy C mean-value clustering fuzzification
Mia et al. An efficient image segmentation method based on linear discriminant analysis and K-means algorithm with automatically splitting and merging clusters
Lin et al. Bayesian information criterion based feature filtering for the fusion of multiple features in high-spatial-resolution satellite scene classification
PirahanSiah et al. Comparison single thresholding method for handwritten images segmentation
Sastry et al. A 3d approach for palm leaf character recognition using histogram computation and distance profile features

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200702

Address after: 223600 Tenth Floor of Building A of Shuyang Software Industrial Park, Suqian City, Jiangsu Province

Patentee after: Jiangsu Fenghuang Xueyi Education Technology Co.,Ltd.

Address before: 210044 Nanjing Ning Road, Jiangsu, No. six, No. 219

Patentee before: Nanjing University of Information Science and Technology

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200727

Address after: Room 02214, 2nd floor, building 2, No.68 yard, Beiqing Road, Haidian District, Beijing 100089

Patentee after: BEIJING PHOENIX E-LEARNING TECHNOLOGY CO.,LTD.

Address before: 210044 Nanjing Ning Road, Jiangsu, No. six, No. 219

Patentee before: Nanjing University of Information Science and Technology

CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100089 room 02114, 2nd floor, building 2, No.68 courtyard, Beiqing Road, Haidian District, Beijing

Patentee after: Discipline network (Beijing) Co.,Ltd.

Address before: 100089 room 02114, 2nd floor, building 2, No.68 courtyard, Beiqing Road, Haidian District, Beijing

Patentee before: BEIJING PHOENIX E-LEARNING TECHNOLOGY CO.,LTD.