CN103810484A

CN103810484A - Print file identification method based on print font library analysis

Info

Publication number: CN103810484A
Application number: CN201310538041.1A
Authority: CN
Inventors: 姚勇; 王韦桦; 张东方; 郭红艳
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2013-10-29
Filing date: 2013-10-29
Publication date: 2014-05-21
Anticipated expiration: 2033-10-29
Also published as: CN103810484B

Abstract

The invention relates to a print file identification method based on print font library analysis and belongs to the technical field of print file identification. The method mainly comprises the steps of extracting the characteristics of Chinese character images of various types of sample printers, through a learning process, training samples into a characteristic value library corresponding to the same Chinese character of various types of printers used by a system, and matching the characteristic value library with characteristic Chinese character libraries sequentially to complete rough classification of the Chinese character libraries; based on the step above, characterizing Chinese characters more deeply through HU-moment characteristic values until the characterized characteristic information can uniquely identify the Chinese characters in the images. The computer identification method for identifying character images of different types of the printers through commonly-used character libraries is clear in theory, simple in process, convenient to operate, rapid and accurate and is an effective and workable method. The method integrates various technologies and useful information to improve the identification accuracy and provides an automatic computer print file identification system for public security and physical evidence identification departments.

Description

The print file discrimination method of analyzing based on printing character library

Technical field

The present invention relates to a kind of print file discrimination method of analyzing based on printing character library, belong to print file authentication technique field.

Background technology

Along with popularizing of office automation, printer is widely used in daily life and work.And print file are as topmost written record form, no matter in criminal suit, or in civil, administrative litigation, the censorship amount of print file check all sharply increases, and examination requirements mainly comprises that the true and false of file is differentiated, three aspects such as (board establishing identity) and formation time check are checked in source.Wherein how to utilize Computer Applied Technology to realize the method that judge fast type of printer, provide important clue by the investigation for case, and these class methods so far there are no report, even also shortage very of the research data of this respect.

Because principle of work and the printing of various printers are not quite similar by character library, printing writing should there are differences, so by extracting type fount, and it is carried out to discriminance analysis, again by character and fontlib through overmatching, utilize sorter to identify and generate the printer of investigating print file.The utilization of Character Font Recognition technology is very crucial thus, and in Character Font Recognition technology, the frequency domain character that has plenty of the method extraction individual Chinese character that uses wavelet analysis does training sample, uses the quadratic classifier of revising to carry out the identification of monocase at present.Have plenty of and adopt the wavelet analysis method application BP neural network based on wavelet energy distribution proportion feature to realize identification in the irrelevant condition of word.But the recognition effect of said method can significantly decline in the time that number of fonts increases.Also have based on texture analysis, utilize Gabor wave filter extraction Chinese character style feature to carry out Chinese Character Font Recognition.Its recognition speed is fast, and discrimination is very high, but its dimension is high, and calculated amount is large.There is in addition the Character Font Recognition of the gray scale utilized based on empirical mode decomposition.This method recognition of dimension low (only having 9 dimensions), calculated amount is little and discrimination is high.Above method cannot be extracted for monocase.Also having is the Character Font Recognition technology for monocase, is mainly to utilize wavelet transformation to extract character feature, and seven kinds of monocase Chinese character styles are identified. its discrimination is high.But intrinsic dimensionality reaches 256 dimensions, this can have a strong impact on the recognition speed of recognizer, increases calculated amount.

Chinese character is pictograph, and number of words is many, font variation is abundant, and structure is very complicated, and its average stroke number is the more than ten times of English alphabet.And there are multiple font, printing Chinese character font to mainly contain the Song typeface, imitation Song-Dynasty-Style typeface, regular script, black matrix, lishu and children's circle etc.Between them, difference is: the difference of overall font. for example, from the font of whole word, and Song typeface font pros; Imitation Song-Dynasty-Style typeface is the font of imitating Song's version book, and font is slightly long; Regular script font is similar to handwritten form, pros; Black matrix font Founder; The basic structure of lishu is square, and body side is flat; Children's circle font is round and smooth, and body is bigger.The stroke of different fonts is different on thickness changes.The size of each word differs larger.Also have stroke to decorate and orientation angle difference, same basic strokes is obviously different with receipts pen place in the first stroke of a Chinese character of different fonts.The angle that basic strokes is write is in addition also different.

Because Chinese character is that the basic strokes such as, slash horizontal, vertical by these, right-falling stroke, point, folding, hook are arranged, combined and form.Therefore, Chinese character can represent its feature by " horizontal, vertical, skim, right-falling stroke ".Equally, the feature of various fonts is also embodied in the middle of stroke, and font also can represent its feature by " horizontal, vertical, skim, right-falling stroke ".

Summary of the invention

The object of the invention is to utilize the analysis of printout font character library to differentiate printing type, under the condition that reduces as far as possible workload, make full use of effective information, adopt conventional font and font size to judge conventional printing type, to prepare for expanding research range.

To achieve these goals, technical scheme of the present invention is as follows.

A kind of print file discrimination method of analyzing based on printing character library, its key step comprises: the feature of extracting the sample printer Chinese character image of different model, pass through learning process, sample training is become to the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used, it is the HU moment characteristics value of stroke feature and simplification, obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Then, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image.

Its concrete steps are included as: the feature extraction of (1) font and character library are set up; (2) classifier design.Wherein:

(1) feature extraction of font and character library are set up: the extraction of Chinese character image feature, it is exactly the feature according to Chinese character image, work out the code word that can represent it, the corresponding Chinese character of this code word, obviously, the code word of different Chinese character image must be not identical, and the character of code word representative has uniqueness.Then, through training study, adopt identical character representation method, set up the Chinese character base that belongs to this type feature.

(1a) extract stroke feature sequence: from the signature analysis to Chinese character, stroke direction clue comprehensively, accurately, has stably reflected the composition information of Chinese character, by adding up the stroke feature of Chinese character file and picture in file to be tested, realize and distinguish different Chinese character styles, and carry out thus its affiliated type of printer of interpretation, specific implementation step is as follows:

The 1st step, Chinese character image is equally divided into eight regions, and according to from left to right, order is from top to bottom added up the black pixel (pixel that value is 1) in each region successively, like this, can obtain eight eigenwerts according to the black pixel count in eight regions.

The 2nd step, adopts stroke to pass through and obtain eigenwert, selects horizontal passing through for twice with longitudinal twice to pass through, pass through respectively at 1/3 and 2/3 horizontal place, record through stain number, in like manner, in longitudinal Using such method, like this, can obtain again four eigenwerts.

The 3rd step, all black pixel numbers in statistical picture, like this, just obtain an eigenwert again, add eight eigenwerts of the first step, four eigenwerts of second step, just have altogether 13 eigenwerts.

(1b) extract strokes sequence apart from feature: by the extraction of stroke feature sequence, we find that this sequence in fact can be counted as N independently random same distribution variable.Like this, on the basis that obtains strokes sequence, just can realize identification by the moment characteristics that extracts image.Utilizing square invariant to carry out image recognition is a kind of important method in pattern-recognition.Square in statistics for characterizing the distribution of random quantity, in mechanics for representing the space distribution of material.If bianry image or gray level image are regarded as to two-dimentional density fonction, just square technology can be applied in graphical analysis.Like this, square just can be for describing the feature of piece image, and be extracted as and statistics and mechanically similar feature.The image circle personage's that the invariant feature of the square value of being asked for by two and three dimensions image in recent years, has caused attention.The square type of skill is a lot, has now been applied to many aspects of Images Classification and identifying processing.For each stroke feature sequence, considering that on the basis of intrinsic dimensionality and computing velocity, we extract the first order and second order moments of discrete HU square as eigenwert.In actual application, the processing of image is often used to discrete function, therefore under discrete case, the research of bending moment is more not meaningful, and establishing f (x, y) is certain two dimensional image function, and its (p+q) rank moment of the orign is defined as:

m_{pq} = Σ_{m = 1}^{M} Σ_{n = 1}^{N} x^{p} y^{q} f (x, y) - - - (1)

μ_{pq} = Σ_{m = 1}^{M} Σ_{n = 1}^{N} {(x - \overset{&OverBar;}{x})}^{p} {(y - \overset{&OverBar;}{y})}^{q} f (x, y) - - - (2)

Wherein

be the centre of moment coordinate in region.Normalized center square is expressed as η simultaneously _pq, be defined as:

η_{pq} = μ_{pq} / μ_{00}^{Y} - - - (3)

Wherein Y=(p+q)/2p+q=2,3 ...

Utilize second order and three normalization center, rank squares can derive following 7 bending moment groups not, the exponent number of center square is larger, and the shape details reflecting is more, but simultaneously more responsive to noise, and calculated amount is large, and only has M under discrete case ₁still there is rotational invariance, can prove that other six squares in bending moment not also have rotational invariance.In the present embodiment, select the smaller invariant M of calculated amount ₁, M ₂, M ₃, M ₄.The not bending moment of image has unchangeability in the time of image generation affined transformation, and, when image is in rotation, translation, evenly when the conversion such as flexible, the value of its square can not change, and M ₁, M ₂, M ₃, M ₄calculated amount is not too large, and it is suitable selecting its constant parameter as identification target.Choose φ ₁=M ₁, φ ₂=M ₂, φ ₃=M ₃, φ ₄=M ₄as front 4 characteristic quantities.

\{\begin{matrix} M_{1} = η_{20} + η_{02} \\ M_{2} = {(η_{20} - η_{02})}^{2} + {4 η}_{11}^{2} \\ M_{3} = {(η_{30} - {3 η}_{12})}^{2} + {({3 η}_{31} - η_{03})}^{2} \\ M 4 = {(η_{30} + η_{12})}^{2} + {(η_{21} + η_{03})}^{2} \\ M_{5} = (η_{30} - {3 η}_{12}) (η_{30} - η_{12}) [{(η_{30} + η_{12})}^{2} - 3 {(η_{21} + η_{03})}^{2}] \\ + ({3 η}_{21} - η_{03}) (η_{21} + η_{03}) [3 {(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] \\ M_{6} = (η_{20} - η_{02}) [{(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] + {4 η}_{11} (η_{30} + η_{12}) (η_{21} + η_{03}) \\ M_{7} = ({3 η}_{12} - η_{30}) (η_{30} + η_{12}) [{(η_{30} + η_{12})}^{2} - 3 {(η_{21} + η_{03})}^{2}] \\ ({3 η}_{21} - η_{03}) (η_{21} + η_{03}) [3 {(η_{03} + η_{12})}^{2} - {(η_{12} + η_{30})}^{2}] \end{matrix} - - - (4)

(1c) foundation of standard character library:

The feature of extracting the sample printer Chinese character image of different model, by learning process, becomes the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used sample training.Take the most frequently used standard Chinese character as object, font is respectively the conventional Song typeface, imitation Song-Dynasty-Style typeface, regular script, black matrix, lishu and children's circle, and font size is one to No. six.Choose the HU moment characteristics value of simplification, for Chinese character to be identified, adopt the mode of secondary coding: first obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Secondly, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image.

(2) classifier design: classifier design is the type of differentiating printer for realizing,, by the contrast of word to be checked and standard word planting modes on sink characteristic value, realizes the discriminating of file printout machine type.But under many factors restriction, in the time processing large character set identification problem, often still select minimum distance classifier at present.Adopt strategy thick, disaggregated classification two-stage classification based on Confidence Analysis to complete the judgement of classification under Chinese character to be identified.

(2a) rough sort: the object of rough sort is to select fast a number very little subset of candidate words relatively in a large character set, and guarantees that the probability that comprises the affiliated correct classification of character to be identified in Candidate Set is large as far as possible.This just requires, and rough sort device is simple in structure, fast operation.For this reason, we have designed a kind of Euclidean distance sorter, establish M _ii HU moment characteristics value of font to be identified,

be i standard HU moment characteristics average of k kind font, when condition, font to be identified is considered to k below meet ₀plant font, wherein G is font classification number.

k_{0} = \underset{1 \leq k \leq G}{\arg \min} {Σ_{i = 1}^{4} {(M_{i} - M_{i}^{k})}^{2}} - - - (5)

(2b) disaggregated classification: Bayes classifier is optimum in theory statistical sorter, and in the time processing practical problems, people wish to go to approach it as far as possible.When under the equal condition of the prior probability that is characterized as Gaussian distribution and the distribution of each category feature of character, Bayes classifier is reduced to mahalanobis distance sorter.But this condition is difficult for meeting in practice conventionally, and the performance of mahalanobis distance sorter is seriously deteriorated along with the generation of estimation error of the covarianee matrix.We adopt correction secondary Discrimination Functions MQDF to measure as disaggregated classification, and it is a distortion of mahalanobis distance, and its functional form is:

g_{i} (X) = \frac{1}{h^{2}} {Σ_{i = 1}^{d} {(x_{i} - m_{ij})}^{2} - Σ_{i = 1}^{k} {(1 - \frac{h^{2}}{λ_{ij}})}^{2} {[{(X - M_{i})}^{T} φ_{ij}]}^{2} + \ln (h^{2 (d - k)} Π_{j = 1}^{K} λ_{ij})} - - - (6)

Wherein λ _ijand φ _iji the eigenwert and the proper vector that are respectively the covariance matrix of j class sample, K represents the number of intercepted main latent vector, i.e. the principal subspace dimension of Pattern Class, its optimal value is definite by testing, h ²that the experiment of little eigenvalue is estimated.What MQDF produced is second judgement curved surface, because only needing to estimate front K main latent vector of each classification covariance matrix, has avoided the negative effect of little eigenvalue evaluated error.MQDF differentiates that distance can be regarded as the weighted sum of the mahalanobis distance in d dimension principal subspace and the Euclidean distance in remaining (d-K) dimension space, and weighting factor is 1/h ².

(2c) confidence calculations: the output Candidate Set of establishing rough sort device is { (c ₁, d ₁), (c ₂, d ₂) ... (c _n, d _n), n is Candidate Set capacity, c _nand d _nwei candidate characters and corresponding rough sort distance.

The effect of disaggregated classification device is according to the distance recalculating, rough sort Candidate Set to be sorted again, finds the affiliated most probable classification of input character.If rough sort result is can property enough high, in other words, if c ₁while being the correct classification of input character, disaggregated classification need not carry out completely.According to the degree of confidence f of rough segmentation result _consize determine whether need to carry out disaggregated classification, adopt the distance of output as tolerance, according to lower calculating degree of confidence:

f _con＝(d ₂-d ₁)/d ₁ (7)

Degree of confidence during lower than certain threshold value, is sent rough sort Candidate Set into the processing of disaggregated classification device, otherwise is directly exported rough sort result.

This beneficial effect of the invention is: in the present invention, proposed a kind ofly to differentiate that to commonly use character library the character image computing machine of printing type knows method for distinguishing, theoretical clear and definite, process is concise and to the point, easy to operate, quick and precisely, be a kind of effective and feasible method.Although the scope of the invention is only confined to printer, the method can expand to the field such as facsimile recorder, duplicating machine, has a extensive future.In addition, in print file, also exist ink powder to pile up the text independent characteristics such as edge roughness splash, these text independent characteristic methods are not subject to the restriction of printable character content, range of application is wider, therefore to improve discriminating accuracy rate, also should utilize more text independent characteristic, comprehensive various features improves the accuracy rate that print file are differentiated.The comprehensive multiple technologies of this invention and useful information, to improve discriminating accuracy rate, for public safety and material evidence evaluation department provide the computer printout inspection of document system of robotization.

Accompanying drawing explanation

Fig. 1 is that in the embodiment of the present invention, printer intelligence is differentiated structured flowchart.

Fig. 2 adopts stroke transverse crossing to obtain eigenwert schematic diagram in the embodiment of the present invention.

Fig. 3 adopts stroke longitudinally to pass through to obtain eigenwert schematic diagram in the embodiment of the present invention

Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described, to better understand the present invention.

Embodiment

Printer intelligence in the present embodiment is differentiated structured flowchart as shown in Figure 1, key step comprises: the feature of extracting the sample printer Chinese character image of different model, pass through learning process, sample training is become to the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used, it is the HU moment characteristics value of stroke feature and simplification, obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Then, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image.

Its concrete steps comprise: the feature extraction of (1) font and character library are set up; (2) classifier design.Wherein:

The 2nd step, adopts stroke to pass through and obtain eigenwert, selects horizontal passing through for twice with longitudinal twice to pass through, pass through respectively at 1/3 and 2/3 horizontal place, record through stain number, in like manner, in longitudinal Using such method, like this, can obtain again four eigenwerts.As shown in Figure 2 and Figure 3.

m_{pq} = Σ_{m = 1}^{M} Σ_{n = 1}^{N} x^{p} y^{q} f (x, y) - - - (1)

μ_{pq} = Σ_{m = 1}^{M} Σ_{n = 1}^{N} {(x - \overset{&OverBar;}{x})}^{p} {(y - \overset{&OverBar;}{y})}^{q} f (x, y) - - - (2)

Wherein

η_{pq} = μ_{pq} / μ_{00}^{Y} - - - (3)

Wherein Y=(p+q)/2p+q=2,3 ...

\{\begin{matrix} M_{1} = η_{20} + η_{02} \\ M_{2} = {(η_{20} - η_{02})}^{2} + {4 η}_{11}^{2} \\ M_{3} = {(η_{30} - {3 η}_{12})}^{2} + {({3 η}_{31} - η_{03})}^{2} \\ M 4 = {(η_{30} + η_{12})}^{2} + {(η_{21} + η_{03})}^{2} \\ M_{5} = (η_{30} - {3 η}_{12}) (η_{30} - η_{12}) [{(η_{30} + η_{12})}^{2} - 3 {(η_{21} + η_{03})}^{2}] \\ + ({3 η}_{21} - η_{03}) (η_{21} + η_{03}) [3 {(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] \\ M_{6} = (η_{20} - η_{02}) [{(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] + {4 η}_{11} (η_{30} + η_{12}) (η_{21} + η_{03}) \\ M_{7} = ({3 η}_{12} - η_{30}) (η_{30} + η_{12}) [{(η_{30} + η_{12})}^{2} - 3 {(η_{21} + η_{03})}^{2}] \\ ({3 η}_{21} - η_{03}) (η_{21} + η_{03}) [3 {(η_{03} + η_{12})}^{2} - {(η_{12} + η_{30})}^{2}] \end{matrix} - - - (4)

(1c) foundation of standard character library:

(2) classifier design:

In the present embodiment, classifier design is the type of differentiating printer for realizing,, by the contrast of word to be checked and standard word planting modes on sink characteristic value, realizes the discriminating of file printout machine type.But under many factors restriction, in the time processing large character set identification problem, often still select minimum distance classifier at present.The present embodiment adopts the strategy thick, disaggregated classification two-stage classification based on Confidence Analysis to complete the judgement of classification under Chinese character to be identified.

k_{0} = \underset{1 \leq k \leq G}{\arg \min} {Σ_{i = 1}^{4} {(M_{i} - M_{i}^{k})}^{2}} - - - (5)

g_{i} (X) = \frac{1}{h^{2}} {Σ_{i = 1}^{d} {(x_{i} - m_{ij})}^{2} - Σ_{i = 1}^{k} {(1 - \frac{h^{2}}{λ_{ij}})}^{2} {[{(X - M_{i})}^{T} φ_{ij}]}^{2} + \ln (h^{2 (d - k)} Π_{j = 1}^{K} λ_{ij})} - - - (6)

f _con＝(d ₂-d ₁)/d ₁ (7)

The present embodiment effect is described:

In order to verify the validity of this paper method, we have chosen 6 kinds of conventional Chinese character styles, i.e. the Song typeface, regular script, black matrix, imitation Song-Dynasty-style typeface, lishu and children's circle.Every kind of font is divided again 4 kinds of fonts, i.e. standard body, runic, italic and bold Italic, altogether 24 kinds of fonts.Training and testing sample is divided into two classes, the file and picture that computing machine generates and the file and picture obtaining through scanner.The file and picture that computing machine generates is generated by Photoshop cs4.0, and image resolution ratio is 72pixels/inch, grayscale mode.Scan image is obtained by HP scanner scanning, and scanning resolution is 96dpi, grayscale mode.Every kind of font is for the sample tricks of training and testing in table 1, and every cover sample comprises 3755 Chinese characters of GB one-level.We identify the font of Chinese character with two sorters respectively, and sequence length is respectively 30000 and 50000, the results are shown in Table 2, table 3, the wherein true classification of line display sample, and the recognition result of sample is shown in list.

The sample tricks of table 1 training and testing

	The Song typeface	Regular script	Black matrix	Imitation Song-Dynasty-style typeface	Lishu	Children's circle
							Training	180	60	100	50	70	90
Test	20	35	40	20	10	30

Table 2 print file source printer identification result

(sequence length 30000)

Table 3 print file source printer identification result

(sequence length 50000)

In the present embodiment, the method proposing has certain advantage at overall recognition result.The experiment of this paper is all the average statistics feature based on stroke feature stochastic distribution simultaneously, and sequence length is longer, and average statistics characteristic is better.But in the time that sequence length reaches certain value, discrimination conversion is little.In addition, the effect of Euclidean distance sorter is more a lot of than the weak effect of MQDF sorter.Reason is the average that the mode standard of its every class has read such sample simply, and does not utilize information such as variance.MDQF sorter is by blocking of little eigenwert highlighted to main information, thus increase rate discrimination.

The above is the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications are also considered as protection scope of the present invention.

Claims

1. a print file discrimination method of analyzing based on printing character library, it is characterized in that: key step comprises: the feature of extracting the sample printer Chinese character image of different model, pass through learning process, sample training is become to the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used, it is the HU moment characteristics value of stroke feature and simplification, obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Then, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image;

Its detailed process is:

(1) feature extraction of font and character library are set up: the extraction of Chinese character image feature, it is exactly the feature according to Chinese character image, work out the code word that can represent it, the corresponding Chinese character of this code word, through training study, adopt identical character representation method, set up the Chinese character base that belongs to this type feature, step is:

The 1st step, Chinese character image is equally divided into eight regions, and according to from left to right, order is from top to bottom added up the black pixel (pixel that value is 1) in each region successively, like this, can obtain eight eigenwerts according to the black pixel count in eight regions;

The 2nd step, adopts stroke to pass through and obtain eigenwert, selects horizontal passing through for twice with longitudinal twice to pass through, pass through respectively at 1/3 and 2/3 horizontal place, record through stain number, in like manner, in longitudinal Using such method, like this, can obtain again four eigenwerts;

The 3rd step, all black pixel numbers in statistical picture, like this, just obtain an eigenwert again, add eight eigenwerts of the first step, four eigenwerts of second step, just have altogether 13 eigenwerts;

(1b) extract strokes sequence apart from feature: for each stroke feature sequence, considering on the basis of intrinsic dimensionality and computing velocity, the first order and second order moments that extracts discrete HU square is as eigenwert: the processing of image is used to discrete function, if f is (x, y) be certain two dimensional image function, its (p+q) rank moment of the orign is defined as:

m_{pq} = Σ_{m = 1}^{M} Σ_{n = 1}^{N} x^{p} y^{q} f (x, y) - - - (1)

μ_{pq} = Σ_{m = 1}^{M} Σ_{n = 1}^{N} {(x - \overset{&OverBar;}{x})}^{p} {(y - \overset{&OverBar;}{y})}^{q} f (x, y) - - - (2)

Wherein

η_{pq} = μ_{pq} / μ_{00}^{Y} - - - (3)

Wherein Y=(p+q)/2p+q=2,3,

Utilize second order and three normalization center, rank squares can derive 7 not bending moment groups, the exponent number of center square is larger, and the shape details reflecting is more, but simultaneously more responsive to noise, and calculated amount is large, and only has M under discrete case ₁still there is rotational invariance; Select the smaller invariant M of calculated amount ₁, M ₂, M ₃, M ₄; The not bending moment of image has unchangeability in the time of image generation affined transformation, and, when image is in rotation, translation, evenly when the conversion such as flexible, the value of its square can not change, and M ₁, M ₂, M ₃, M ₄calculated amount is not too large, and it is suitable selecting its constant parameter as identification target, chooses φ ₁=M ₁, φ ₂=M ₂, φ ₃=M ₃, φ ₄=M ₄as front 4 characteristic quantities:

\{\begin{matrix} M_{1} = η_{20} + η_{02} \\ M_{2} = {(η_{20} - η_{02})}^{2} + {4 η}_{11}^{2} \\ M_{3} = {(η_{30} - {3 η}_{12})}^{2} + {({3 η}_{31} - η_{03})}^{2} \\ M 4 = {(η_{30} + η_{12})}^{2} + {(η_{21} + η_{03})}^{2} \\ M_{5} = (η_{30} - {3 η}_{12}) (η_{30} - η_{12}) [{(η_{30} + η_{12})}^{2} - 3 {(η_{21} + η_{03})}^{2}] \\ + ({3 η}_{21} - η_{03}) (η_{21} + η_{03}) [3 {(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] \\ M_{6} = (η_{20} - η_{02}) [{(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] + {4 η}_{11} (η_{30} + η_{12}) (η_{21} + η_{03}) \\ M_{7} = ({3 η}_{12} - η_{30}) (η_{30} + η_{12}) [{(η_{30} + η_{12})}^{2} - 3 {(η_{21} + η_{03})}^{2}] \\ ({3 η}_{21} - η_{03}) (η_{21} + η_{03}) [3 {(η_{03} + η_{12})}^{2} - {(η_{12} + η_{30})}^{2}] \end{matrix} - - - (4)

(1c) foundation of standard character library: the feature of extracting the sample printer Chinese character image of different model, pass through learning process, sample training is become to the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used, take the most frequently used standard Chinese character as object, font is respectively the conventional Song typeface, imitation Song-Dynasty-Style typeface, regular script, black matrix, lishu and children's circle, font size is one to No. six, choose the HU moment characteristics value of simplification, for Chinese character to be identified, adopt the mode of secondary coding: the total pixel number that first obtains Chinese character image, these statistical informations of number of hits, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification, secondly, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image,

(2) classifier design: by the contrast of word to be checked and standard word planting modes on sink characteristic value, realize the discriminating of file printout machine type; Under many factors restriction, in the time processing large character set identification problem, select minimum distance classifier; Adopt strategy thick, disaggregated classification two-stage classification based on Confidence Analysis to complete the judgement of classification under Chinese character to be identified:

(2a) rough sort: design a kind of Euclidean distance sorter, establish M _ii HU moment characteristics value of font to be identified,

be i standard HU moment characteristics average of k kind font, when condition, font to be identified is considered to k below meet ₀plant font, wherein G is font classification number;

k_{0} = \underset{1 \leq k \leq G}{\arg \min} {Σ_{i = 1}^{4} {(M_{i} - M_{i}^{k})}^{2}} - - - (5)

(2b) disaggregated classification: adopt correction secondary Discrimination Functions MQDF to measure as disaggregated classification, it is a distortion of mahalanobis distance, and its functional form is:

g_{i} (X) = \frac{1}{h^{2}} {Σ_{i = 1}^{d} {(x_{i} - m_{ij})}^{2} - Σ_{i = 1}^{k} {(1 - \frac{h^{2}}{λ_{ij}})}^{2} {[{(X - M_{i})}^{T} φ_{ij}]}^{2} + \ln (h^{2 (d - k)} Π_{j = 1}^{K} λ_{ij})} - - - (6)

Wherein λ _ijand φ _iji the eigenwert and the proper vector that are respectively the covariance matrix of j class sample, K represents the number of intercepted main latent vector, i.e. the principal subspace dimension of Pattern Class, its optimal value is definite by testing, h ²that the experiment of little eigenvalue is estimated; What MQDF produced is second judgement curved surface, because only needing to estimate front K main latent vector of each classification covariance matrix, has avoided the negative effect of little eigenvalue evaluated error; MQDF differentiates that distance can be regarded as the weighted sum of the mahalanobis distance in d dimension principal subspace and the Euclidean distance in remaining (d-K) dimension space, and weighting factor is 1/h ²;

(2c) confidence calculations: the output Candidate Set of establishing rough sort device is { (c ₁, d ₁), (c ₂, d ₂) ... (c _n, d _n), n is Candidate Set capacity, c _nand d _nwei candidate characters and corresponding rough sort distance; If rough sort result is can property enough high, in other words, if c ₁while being the correct classification of input character, disaggregated classification need not carry out; According to the degree of confidence f of rough segmentation result _consize determine whether need to carry out disaggregated classification, adopt the distance of output as tolerance, according to lower calculating degree of confidence:

f _con＝(d ₂-d ₁)/d ₁ (7)

Degree of confidence during lower than set threshold value, is sent rough sort Candidate Set into the processing of disaggregated classification device, otherwise is directly exported rough sort result.