CN103810484A - Print file identification method based on print font library analysis - Google Patents

Print file identification method based on print font library analysis Download PDF

Info

Publication number
CN103810484A
CN103810484A CN201310538041.1A CN201310538041A CN103810484A CN 103810484 A CN103810484 A CN 103810484A CN 201310538041 A CN201310538041 A CN 201310538041A CN 103810484 A CN103810484 A CN 103810484A
Authority
CN
China
Prior art keywords
eta
chinese character
feature
image
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310538041.1A
Other languages
Chinese (zh)
Other versions
CN103810484B (en
Inventor
姚勇
王韦桦
张东方
郭红艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201310538041.1A priority Critical patent/CN103810484B/en
Publication of CN103810484A publication Critical patent/CN103810484A/en
Application granted granted Critical
Publication of CN103810484B publication Critical patent/CN103810484B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a print file identification method based on print font library analysis and belongs to the technical field of print file identification. The method mainly comprises the steps of extracting the characteristics of Chinese character images of various types of sample printers, through a learning process, training samples into a characteristic value library corresponding to the same Chinese character of various types of printers used by a system, and matching the characteristic value library with characteristic Chinese character libraries sequentially to complete rough classification of the Chinese character libraries; based on the step above, characterizing Chinese characters more deeply through HU-moment characteristic values until the characterized characteristic information can uniquely identify the Chinese characters in the images. The computer identification method for identifying character images of different types of the printers through commonly-used character libraries is clear in theory, simple in process, convenient to operate, rapid and accurate and is an effective and workable method. The method integrates various technologies and useful information to improve the identification accuracy and provides an automatic computer print file identification system for public security and physical evidence identification departments.

Description

The print file discrimination method of analyzing based on printing character library
Technical field
The present invention relates to a kind of print file discrimination method of analyzing based on printing character library, belong to print file authentication technique field.
Background technology
Along with popularizing of office automation, printer is widely used in daily life and work.And print file are as topmost written record form, no matter in criminal suit, or in civil, administrative litigation, the censorship amount of print file check all sharply increases, and examination requirements mainly comprises that the true and false of file is differentiated, three aspects such as (board establishing identity) and formation time check are checked in source.Wherein how to utilize Computer Applied Technology to realize the method that judge fast type of printer, provide important clue by the investigation for case, and these class methods so far there are no report, even also shortage very of the research data of this respect.
Because principle of work and the printing of various printers are not quite similar by character library, printing writing should there are differences, so by extracting type fount, and it is carried out to discriminance analysis, again by character and fontlib through overmatching, utilize sorter to identify and generate the printer of investigating print file.The utilization of Character Font Recognition technology is very crucial thus, and in Character Font Recognition technology, the frequency domain character that has plenty of the method extraction individual Chinese character that uses wavelet analysis does training sample, uses the quadratic classifier of revising to carry out the identification of monocase at present.Have plenty of and adopt the wavelet analysis method application BP neural network based on wavelet energy distribution proportion feature to realize identification in the irrelevant condition of word.But the recognition effect of said method can significantly decline in the time that number of fonts increases.Also have based on texture analysis, utilize Gabor wave filter extraction Chinese character style feature to carry out Chinese Character Font Recognition.Its recognition speed is fast, and discrimination is very high, but its dimension is high, and calculated amount is large.There is in addition the Character Font Recognition of the gray scale utilized based on empirical mode decomposition.This method recognition of dimension low (only having 9 dimensions), calculated amount is little and discrimination is high.Above method cannot be extracted for monocase.Also having is the Character Font Recognition technology for monocase, is mainly to utilize wavelet transformation to extract character feature, and seven kinds of monocase Chinese character styles are identified. its discrimination is high.But intrinsic dimensionality reaches 256 dimensions, this can have a strong impact on the recognition speed of recognizer, increases calculated amount.
Chinese character is pictograph, and number of words is many, font variation is abundant, and structure is very complicated, and its average stroke number is the more than ten times of English alphabet.And there are multiple font, printing Chinese character font to mainly contain the Song typeface, imitation Song-Dynasty-Style typeface, regular script, black matrix, lishu and children's circle etc.Between them, difference is: the difference of overall font. for example, from the font of whole word, and Song typeface font pros; Imitation Song-Dynasty-Style typeface is the font of imitating Song's version book, and font is slightly long; Regular script font is similar to handwritten form, pros; Black matrix font Founder; The basic structure of lishu is square, and body side is flat; Children's circle font is round and smooth, and body is bigger.The stroke of different fonts is different on thickness changes.The size of each word differs larger.Also have stroke to decorate and orientation angle difference, same basic strokes is obviously different with receipts pen place in the first stroke of a Chinese character of different fonts.The angle that basic strokes is write is in addition also different.
Because Chinese character is that the basic strokes such as, slash horizontal, vertical by these, right-falling stroke, point, folding, hook are arranged, combined and form.Therefore, Chinese character can represent its feature by " horizontal, vertical, skim, right-falling stroke ".Equally, the feature of various fonts is also embodied in the middle of stroke, and font also can represent its feature by " horizontal, vertical, skim, right-falling stroke ".
Summary of the invention
The object of the invention is to utilize the analysis of printout font character library to differentiate printing type, under the condition that reduces as far as possible workload, make full use of effective information, adopt conventional font and font size to judge conventional printing type, to prepare for expanding research range.
To achieve these goals, technical scheme of the present invention is as follows.
A kind of print file discrimination method of analyzing based on printing character library, its key step comprises: the feature of extracting the sample printer Chinese character image of different model, pass through learning process, sample training is become to the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used, it is the HU moment characteristics value of stroke feature and simplification, obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Then, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image.
Its concrete steps are included as: the feature extraction of (1) font and character library are set up; (2) classifier design.Wherein:
(1) feature extraction of font and character library are set up: the extraction of Chinese character image feature, it is exactly the feature according to Chinese character image, work out the code word that can represent it, the corresponding Chinese character of this code word, obviously, the code word of different Chinese character image must be not identical, and the character of code word representative has uniqueness.Then, through training study, adopt identical character representation method, set up the Chinese character base that belongs to this type feature.
(1a) extract stroke feature sequence: from the signature analysis to Chinese character, stroke direction clue comprehensively, accurately, has stably reflected the composition information of Chinese character, by adding up the stroke feature of Chinese character file and picture in file to be tested, realize and distinguish different Chinese character styles, and carry out thus its affiliated type of printer of interpretation, specific implementation step is as follows:
The 1st step, Chinese character image is equally divided into eight regions, and according to from left to right, order is from top to bottom added up the black pixel (pixel that value is 1) in each region successively, like this, can obtain eight eigenwerts according to the black pixel count in eight regions.
The 2nd step, adopts stroke to pass through and obtain eigenwert, selects horizontal passing through for twice with longitudinal twice to pass through, pass through respectively at 1/3 and 2/3 horizontal place, record through stain number, in like manner, in longitudinal Using such method, like this, can obtain again four eigenwerts.
The 3rd step, all black pixel numbers in statistical picture, like this, just obtain an eigenwert again, add eight eigenwerts of the first step, four eigenwerts of second step, just have altogether 13 eigenwerts.
(1b) extract strokes sequence apart from feature: by the extraction of stroke feature sequence, we find that this sequence in fact can be counted as N independently random same distribution variable.Like this, on the basis that obtains strokes sequence, just can realize identification by the moment characteristics that extracts image.Utilizing square invariant to carry out image recognition is a kind of important method in pattern-recognition.Square in statistics for characterizing the distribution of random quantity, in mechanics for representing the space distribution of material.If bianry image or gray level image are regarded as to two-dimentional density fonction, just square technology can be applied in graphical analysis.Like this, square just can be for describing the feature of piece image, and be extracted as and statistics and mechanically similar feature.The image circle personage's that the invariant feature of the square value of being asked for by two and three dimensions image in recent years, has caused attention.The square type of skill is a lot, has now been applied to many aspects of Images Classification and identifying processing.For each stroke feature sequence, considering that on the basis of intrinsic dimensionality and computing velocity, we extract the first order and second order moments of discrete HU square as eigenwert.In actual application, the processing of image is often used to discrete function, therefore under discrete case, the research of bending moment is more not meaningful, and establishing f (x, y) is certain two dimensional image function, and its (p+q) rank moment of the orign is defined as:
m pq = Σ m = 1 M Σ n = 1 N x p y q f ( x , y ) - - - ( 1 )
μ pq = Σ m = 1 M Σ n = 1 N ( x - x ‾ ) p ( y - y ‾ ) q f ( x , y ) - - - ( 2 )
Wherein
Figure BSA0000097112550000033
be the centre of moment coordinate in region.Normalized center square is expressed as η simultaneously pq, be defined as:
η pq = μ pq / μ 00 Y - - - ( 3 )
Wherein Y=(p+q)/2p+q=2,3 ...
Utilize second order and three normalization center, rank squares can derive following 7 bending moment groups not, the exponent number of center square is larger, and the shape details reflecting is more, but simultaneously more responsive to noise, and calculated amount is large, and only has M under discrete case 1still there is rotational invariance, can prove that other six squares in bending moment not also have rotational invariance.In the present embodiment, select the smaller invariant M of calculated amount 1, M 2, M 3, M 4.The not bending moment of image has unchangeability in the time of image generation affined transformation, and, when image is in rotation, translation, evenly when the conversion such as flexible, the value of its square can not change, and M 1, M 2, M 3, M 4calculated amount is not too large, and it is suitable selecting its constant parameter as identification target.Choose φ 1=M 1, φ 2=M 2, φ 3=M 3, φ 4=M 4as front 4 characteristic quantities.
M 1 = η 20 + η 02 M 2 = ( η 20 - η 02 ) 2 + 4 η 11 2 M 3 = ( η 30 - 3 η 12 ) 2 + ( 3 η 31 - η 03 ) 2 M 4 = ( η 30 + η 12 ) 2 + ( η 21 + η 03 ) 2 M 5 = ( η 30 - 3 η 12 ) ( η 30 - η 12 ) [ ( η 30 + η 12 ) 2 - 3 ( η 21 + η 03 ) 2 ] + ( 3 η 21 - η 03 ) ( η 21 + η 03 ) [ 3 ( η 30 + η 12 ) 2 - ( η 21 + η 03 ) 2 ] M 6 = ( η 20 - η 02 ) [ ( η 30 + η 12 ) 2 - ( η 21 + η 03 ) 2 ] + 4 η 11 ( η 30 + η 12 ) ( η 21 + η 03 ) M 7 = ( 3 η 12 - η 30 ) ( η 30 + η 12 ) [ ( η 30 + η 12 ) 2 - 3 ( η 21 + η 03 ) 2 ] ( 3 η 21 - η 03 ) ( η 21 + η 03 ) [ 3 ( η 03 + η 12 ) 2 - ( η 12 + η 30 ) 2 ] - - - ( 4 )
(1c) foundation of standard character library:
The feature of extracting the sample printer Chinese character image of different model, by learning process, becomes the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used sample training.Take the most frequently used standard Chinese character as object, font is respectively the conventional Song typeface, imitation Song-Dynasty-Style typeface, regular script, black matrix, lishu and children's circle, and font size is one to No. six.Choose the HU moment characteristics value of simplification, for Chinese character to be identified, adopt the mode of secondary coding: first obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Secondly, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image.
(2) classifier design: classifier design is the type of differentiating printer for realizing,, by the contrast of word to be checked and standard word planting modes on sink characteristic value, realizes the discriminating of file printout machine type.But under many factors restriction, in the time processing large character set identification problem, often still select minimum distance classifier at present.Adopt strategy thick, disaggregated classification two-stage classification based on Confidence Analysis to complete the judgement of classification under Chinese character to be identified.
(2a) rough sort: the object of rough sort is to select fast a number very little subset of candidate words relatively in a large character set, and guarantees that the probability that comprises the affiliated correct classification of character to be identified in Candidate Set is large as far as possible.This just requires, and rough sort device is simple in structure, fast operation.For this reason, we have designed a kind of Euclidean distance sorter, establish M ii HU moment characteristics value of font to be identified,
Figure BSA0000097112550000043
be i standard HU moment characteristics average of k kind font, when condition, font to be identified is considered to k below meet 0plant font, wherein G is font classification number.
k 0 = arg min 1 ≤ k ≤ G { Σ i = 1 4 ( M i - M i k ) 2 } - - - ( 5 )
(2b) disaggregated classification: Bayes classifier is optimum in theory statistical sorter, and in the time processing practical problems, people wish to go to approach it as far as possible.When under the equal condition of the prior probability that is characterized as Gaussian distribution and the distribution of each category feature of character, Bayes classifier is reduced to mahalanobis distance sorter.But this condition is difficult for meeting in practice conventionally, and the performance of mahalanobis distance sorter is seriously deteriorated along with the generation of estimation error of the covarianee matrix.We adopt correction secondary Discrimination Functions MQDF to measure as disaggregated classification, and it is a distortion of mahalanobis distance, and its functional form is:
g i ( X ) = 1 h 2 { Σ i = 1 d ( x i - m ij ) 2 - Σ i = 1 k ( 1 - h 2 λ ij ) 2 [ ( X - M i ) T φ ij ] 2 + ln ( h 2 ( d - k ) Π j = 1 K λ ij ) } - - - ( 6 )
Wherein λ ijand φ iji the eigenwert and the proper vector that are respectively the covariance matrix of j class sample, K represents the number of intercepted main latent vector, i.e. the principal subspace dimension of Pattern Class, its optimal value is definite by testing, h 2that the experiment of little eigenvalue is estimated.What MQDF produced is second judgement curved surface, because only needing to estimate front K main latent vector of each classification covariance matrix, has avoided the negative effect of little eigenvalue evaluated error.MQDF differentiates that distance can be regarded as the weighted sum of the mahalanobis distance in d dimension principal subspace and the Euclidean distance in remaining (d-K) dimension space, and weighting factor is 1/h 2.
(2c) confidence calculations: the output Candidate Set of establishing rough sort device is { (c 1, d 1), (c 2, d 2) ... (c n, d n), n is Candidate Set capacity, c nand d nwei candidate characters and corresponding rough sort distance.
The effect of disaggregated classification device is according to the distance recalculating, rough sort Candidate Set to be sorted again, finds the affiliated most probable classification of input character.If rough sort result is can property enough high, in other words, if c 1while being the correct classification of input character, disaggregated classification need not carry out completely.According to the degree of confidence f of rough segmentation result consize determine whether need to carry out disaggregated classification, adopt the distance of output as tolerance, according to lower calculating degree of confidence:
f con=(d 2-d 1)/d 1 (7)
Degree of confidence during lower than certain threshold value, is sent rough sort Candidate Set into the processing of disaggregated classification device, otherwise is directly exported rough sort result.
This beneficial effect of the invention is: in the present invention, proposed a kind ofly to differentiate that to commonly use character library the character image computing machine of printing type knows method for distinguishing, theoretical clear and definite, process is concise and to the point, easy to operate, quick and precisely, be a kind of effective and feasible method.Although the scope of the invention is only confined to printer, the method can expand to the field such as facsimile recorder, duplicating machine, has a extensive future.In addition, in print file, also exist ink powder to pile up the text independent characteristics such as edge roughness splash, these text independent characteristic methods are not subject to the restriction of printable character content, range of application is wider, therefore to improve discriminating accuracy rate, also should utilize more text independent characteristic, comprehensive various features improves the accuracy rate that print file are differentiated.The comprehensive multiple technologies of this invention and useful information, to improve discriminating accuracy rate, for public safety and material evidence evaluation department provide the computer printout inspection of document system of robotization.
Accompanying drawing explanation
Fig. 1 is that in the embodiment of the present invention, printer intelligence is differentiated structured flowchart.
Fig. 2 adopts stroke transverse crossing to obtain eigenwert schematic diagram in the embodiment of the present invention.
Fig. 3 adopts stroke longitudinally to pass through to obtain eigenwert schematic diagram in the embodiment of the present invention
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described, to better understand the present invention.
Embodiment
Printer intelligence in the present embodiment is differentiated structured flowchart as shown in Figure 1, key step comprises: the feature of extracting the sample printer Chinese character image of different model, pass through learning process, sample training is become to the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used, it is the HU moment characteristics value of stroke feature and simplification, obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Then, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image.
Its concrete steps comprise: the feature extraction of (1) font and character library are set up; (2) classifier design.Wherein:
(1) feature extraction of font and character library are set up: the extraction of Chinese character image feature, it is exactly the feature according to Chinese character image, work out the code word that can represent it, the corresponding Chinese character of this code word, obviously, the code word of different Chinese character image must be not identical, and the character of code word representative has uniqueness.Then, through training study, adopt identical character representation method, set up the Chinese character base that belongs to this type feature.
(1a) extract stroke feature sequence: from the signature analysis to Chinese character, stroke direction clue comprehensively, accurately, has stably reflected the composition information of Chinese character, by adding up the stroke feature of Chinese character file and picture in file to be tested, realize and distinguish different Chinese character styles, and carry out thus its affiliated type of printer of interpretation, specific implementation step is as follows:
The 1st step, Chinese character image is equally divided into eight regions, and according to from left to right, order is from top to bottom added up the black pixel (pixel that value is 1) in each region successively, like this, can obtain eight eigenwerts according to the black pixel count in eight regions.
The 2nd step, adopts stroke to pass through and obtain eigenwert, selects horizontal passing through for twice with longitudinal twice to pass through, pass through respectively at 1/3 and 2/3 horizontal place, record through stain number, in like manner, in longitudinal Using such method, like this, can obtain again four eigenwerts.As shown in Figure 2 and Figure 3.
The 3rd step, all black pixel numbers in statistical picture, like this, just obtain an eigenwert again, add eight eigenwerts of the first step, four eigenwerts of second step, just have altogether 13 eigenwerts.
(1b) extract strokes sequence apart from feature: by the extraction of stroke feature sequence, we find that this sequence in fact can be counted as N independently random same distribution variable.Like this, on the basis that obtains strokes sequence, just can realize identification by the moment characteristics that extracts image.Utilizing square invariant to carry out image recognition is a kind of important method in pattern-recognition.Square in statistics for characterizing the distribution of random quantity, in mechanics for representing the space distribution of material.If bianry image or gray level image are regarded as to two-dimentional density fonction, just square technology can be applied in graphical analysis.Like this, square just can be for describing the feature of piece image, and be extracted as and statistics and mechanically similar feature.The image circle personage's that the invariant feature of the square value of being asked for by two and three dimensions image in recent years, has caused attention.The square type of skill is a lot, has now been applied to many aspects of Images Classification and identifying processing.For each stroke feature sequence, considering that on the basis of intrinsic dimensionality and computing velocity, we extract the first order and second order moments of discrete HU square as eigenwert.In actual application, the processing of image is often used to discrete function, therefore under discrete case, the research of bending moment is more not meaningful, and establishing f (x, y) is certain two dimensional image function, and its (p+q) rank moment of the orign is defined as:
m pq = Σ m = 1 M Σ n = 1 N x p y q f ( x , y ) - - - ( 1 )
μ pq = Σ m = 1 M Σ n = 1 N ( x - x ‾ ) p ( y - y ‾ ) q f ( x , y ) - - - ( 2 )
Wherein
Figure BSA0000097112550000073
be the centre of moment coordinate in region.Normalized center square is expressed as η simultaneously pq, be defined as:
η pq = μ pq / μ 00 Y - - - ( 3 )
Wherein Y=(p+q)/2p+q=2,3 ...
Utilize second order and three normalization center, rank squares can derive following 7 bending moment groups not, the exponent number of center square is larger, and the shape details reflecting is more, but simultaneously more responsive to noise, and calculated amount is large, and only has M under discrete case 1still there is rotational invariance, can prove that other six squares in bending moment not also have rotational invariance.In the present embodiment, select the smaller invariant M of calculated amount 1, M 2, M 3, M 4.The not bending moment of image has unchangeability in the time of image generation affined transformation, and, when image is in rotation, translation, evenly when the conversion such as flexible, the value of its square can not change, and M 1, M 2, M 3, M 4calculated amount is not too large, and it is suitable selecting its constant parameter as identification target.Choose φ 1=M 1, φ 2=M 2, φ 3=M 3, φ 4=M 4as front 4 characteristic quantities.
M 1 = η 20 + η 02 M 2 = ( η 20 - η 02 ) 2 + 4 η 11 2 M 3 = ( η 30 - 3 η 12 ) 2 + ( 3 η 31 - η 03 ) 2 M 4 = ( η 30 + η 12 ) 2 + ( η 21 + η 03 ) 2 M 5 = ( η 30 - 3 η 12 ) ( η 30 - η 12 ) [ ( η 30 + η 12 ) 2 - 3 ( η 21 + η 03 ) 2 ] + ( 3 η 21 - η 03 ) ( η 21 + η 03 ) [ 3 ( η 30 + η 12 ) 2 - ( η 21 + η 03 ) 2 ] M 6 = ( η 20 - η 02 ) [ ( η 30 + η 12 ) 2 - ( η 21 + η 03 ) 2 ] + 4 η 11 ( η 30 + η 12 ) ( η 21 + η 03 ) M 7 = ( 3 η 12 - η 30 ) ( η 30 + η 12 ) [ ( η 30 + η 12 ) 2 - 3 ( η 21 + η 03 ) 2 ] ( 3 η 21 - η 03 ) ( η 21 + η 03 ) [ 3 ( η 03 + η 12 ) 2 - ( η 12 + η 30 ) 2 ] - - - ( 4 )
(1c) foundation of standard character library:
The feature of extracting the sample printer Chinese character image of different model, by learning process, becomes the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used sample training.Take the most frequently used standard Chinese character as object, font is respectively the conventional Song typeface, imitation Song-Dynasty-Style typeface, regular script, black matrix, lishu and children's circle, and font size is one to No. six.Choose the HU moment characteristics value of simplification, for Chinese character to be identified, adopt the mode of secondary coding: first obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Secondly, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image.
(2) classifier design:
In the present embodiment, classifier design is the type of differentiating printer for realizing,, by the contrast of word to be checked and standard word planting modes on sink characteristic value, realizes the discriminating of file printout machine type.But under many factors restriction, in the time processing large character set identification problem, often still select minimum distance classifier at present.The present embodiment adopts the strategy thick, disaggregated classification two-stage classification based on Confidence Analysis to complete the judgement of classification under Chinese character to be identified.
(2a) rough sort: the object of rough sort is to select fast a number very little subset of candidate words relatively in a large character set, and guarantees that the probability that comprises the affiliated correct classification of character to be identified in Candidate Set is large as far as possible.This just requires, and rough sort device is simple in structure, fast operation.For this reason, we have designed a kind of Euclidean distance sorter, establish M ii HU moment characteristics value of font to be identified,
Figure BSA0000097112550000083
be i standard HU moment characteristics average of k kind font, when condition, font to be identified is considered to k below meet 0plant font, wherein G is font classification number.
k 0 = arg min 1 ≤ k ≤ G { Σ i = 1 4 ( M i - M i k ) 2 } - - - ( 5 )
(2b) disaggregated classification: Bayes classifier is optimum in theory statistical sorter, and in the time processing practical problems, people wish to go to approach it as far as possible.When under the equal condition of the prior probability that is characterized as Gaussian distribution and the distribution of each category feature of character, Bayes classifier is reduced to mahalanobis distance sorter.But this condition is difficult for meeting in practice conventionally, and the performance of mahalanobis distance sorter is seriously deteriorated along with the generation of estimation error of the covarianee matrix.We adopt correction secondary Discrimination Functions MQDF to measure as disaggregated classification, and it is a distortion of mahalanobis distance, and its functional form is:
g i ( X ) = 1 h 2 { Σ i = 1 d ( x i - m ij ) 2 - Σ i = 1 k ( 1 - h 2 λ ij ) 2 [ ( X - M i ) T φ ij ] 2 + ln ( h 2 ( d - k ) Π j = 1 K λ ij ) } - - - ( 6 )
Wherein λ ijand φ iji the eigenwert and the proper vector that are respectively the covariance matrix of j class sample, K represents the number of intercepted main latent vector, i.e. the principal subspace dimension of Pattern Class, its optimal value is definite by testing, h 2that the experiment of little eigenvalue is estimated.What MQDF produced is second judgement curved surface, because only needing to estimate front K main latent vector of each classification covariance matrix, has avoided the negative effect of little eigenvalue evaluated error.MQDF differentiates that distance can be regarded as the weighted sum of the mahalanobis distance in d dimension principal subspace and the Euclidean distance in remaining (d-K) dimension space, and weighting factor is 1/h 2.
(2c) confidence calculations: the output Candidate Set of establishing rough sort device is { (c 1, d 1), (c 2, d 2) ... (c n, d n), n is Candidate Set capacity, c nand d nwei candidate characters and corresponding rough sort distance.
The effect of disaggregated classification device is according to the distance recalculating, rough sort Candidate Set to be sorted again, finds the affiliated most probable classification of input character.If rough sort result is can property enough high, in other words, if c 1while being the correct classification of input character, disaggregated classification need not carry out completely.According to the degree of confidence f of rough segmentation result consize determine whether need to carry out disaggregated classification, adopt the distance of output as tolerance, according to lower calculating degree of confidence:
f con=(d 2-d 1)/d 1 (7)
Degree of confidence during lower than certain threshold value, is sent rough sort Candidate Set into the processing of disaggregated classification device, otherwise is directly exported rough sort result.
The present embodiment effect is described:
In order to verify the validity of this paper method, we have chosen 6 kinds of conventional Chinese character styles, i.e. the Song typeface, regular script, black matrix, imitation Song-Dynasty-style typeface, lishu and children's circle.Every kind of font is divided again 4 kinds of fonts, i.e. standard body, runic, italic and bold Italic, altogether 24 kinds of fonts.Training and testing sample is divided into two classes, the file and picture that computing machine generates and the file and picture obtaining through scanner.The file and picture that computing machine generates is generated by Photoshop cs4.0, and image resolution ratio is 72pixels/inch, grayscale mode.Scan image is obtained by HP scanner scanning, and scanning resolution is 96dpi, grayscale mode.Every kind of font is for the sample tricks of training and testing in table 1, and every cover sample comprises 3755 Chinese characters of GB one-level.We identify the font of Chinese character with two sorters respectively, and sequence length is respectively 30000 and 50000, the results are shown in Table 2, table 3, the wherein true classification of line display sample, and the recognition result of sample is shown in list.
The sample tricks of table 1 training and testing
The Song typeface Regular script Black matrix Imitation Song-Dynasty-style typeface Lishu Children's circle
Training 180 60 100 50 70 90
Test 20 35 40 20 10 30
Table 2 print file source printer identification result
(sequence length 30000)
Figure BSA0000097112550000101
Table 3 print file source printer identification result
(sequence length 50000)
Figure BSA0000097112550000102
In the present embodiment, the method proposing has certain advantage at overall recognition result.The experiment of this paper is all the average statistics feature based on stroke feature stochastic distribution simultaneously, and sequence length is longer, and average statistics characteristic is better.But in the time that sequence length reaches certain value, discrimination conversion is little.In addition, the effect of Euclidean distance sorter is more a lot of than the weak effect of MQDF sorter.Reason is the average that the mode standard of its every class has read such sample simply, and does not utilize information such as variance.MDQF sorter is by blocking of little eigenwert highlighted to main information, thus increase rate discrimination.
The above is the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications are also considered as protection scope of the present invention.

Claims (1)

1. a print file discrimination method of analyzing based on printing character library, it is characterized in that: key step comprises: the feature of extracting the sample printer Chinese character image of different model, pass through learning process, sample training is become to the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used, it is the HU moment characteristics value of stroke feature and simplification, obtain total pixel number, these statistical informations of number of hits of Chinese character image, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification; Then, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image;
Its detailed process is:
(1) feature extraction of font and character library are set up: the extraction of Chinese character image feature, it is exactly the feature according to Chinese character image, work out the code word that can represent it, the corresponding Chinese character of this code word, through training study, adopt identical character representation method, set up the Chinese character base that belongs to this type feature, step is:
(1a) extract stroke feature sequence: from the signature analysis to Chinese character, stroke direction clue comprehensively, accurately, has stably reflected the composition information of Chinese character, by adding up the stroke feature of Chinese character file and picture in file to be tested, realize and distinguish different Chinese character styles, and carry out thus its affiliated type of printer of interpretation, specific implementation step is as follows:
The 1st step, Chinese character image is equally divided into eight regions, and according to from left to right, order is from top to bottom added up the black pixel (pixel that value is 1) in each region successively, like this, can obtain eight eigenwerts according to the black pixel count in eight regions;
The 2nd step, adopts stroke to pass through and obtain eigenwert, selects horizontal passing through for twice with longitudinal twice to pass through, pass through respectively at 1/3 and 2/3 horizontal place, record through stain number, in like manner, in longitudinal Using such method, like this, can obtain again four eigenwerts;
The 3rd step, all black pixel numbers in statistical picture, like this, just obtain an eigenwert again, add eight eigenwerts of the first step, four eigenwerts of second step, just have altogether 13 eigenwerts;
(1b) extract strokes sequence apart from feature: for each stroke feature sequence, considering on the basis of intrinsic dimensionality and computing velocity, the first order and second order moments that extracts discrete HU square is as eigenwert: the processing of image is used to discrete function, if f is (x, y) be certain two dimensional image function, its (p+q) rank moment of the orign is defined as:
m pq = Σ m = 1 M Σ n = 1 N x p y q f ( x , y ) - - - ( 1 )
μ pq = Σ m = 1 M Σ n = 1 N ( x - x ‾ ) p ( y - y ‾ ) q f ( x , y ) - - - ( 2 )
Wherein
Figure FSA0000097112540000013
be the centre of moment coordinate in region.Normalized center square is expressed as η simultaneously pq, be defined as:
η pq = μ pq / μ 00 Y - - - ( 3 )
Wherein Y=(p+q)/2p+q=2,3,
Utilize second order and three normalization center, rank squares can derive 7 not bending moment groups, the exponent number of center square is larger, and the shape details reflecting is more, but simultaneously more responsive to noise, and calculated amount is large, and only has M under discrete case 1still there is rotational invariance; Select the smaller invariant M of calculated amount 1, M 2, M 3, M 4; The not bending moment of image has unchangeability in the time of image generation affined transformation, and, when image is in rotation, translation, evenly when the conversion such as flexible, the value of its square can not change, and M 1, M 2, M 3, M 4calculated amount is not too large, and it is suitable selecting its constant parameter as identification target, chooses φ 1=M 1, φ 2=M 2, φ 3=M 3, φ 4=M 4as front 4 characteristic quantities:
M 1 = η 20 + η 02 M 2 = ( η 20 - η 02 ) 2 + 4 η 11 2 M 3 = ( η 30 - 3 η 12 ) 2 + ( 3 η 31 - η 03 ) 2 M 4 = ( η 30 + η 12 ) 2 + ( η 21 + η 03 ) 2 M 5 = ( η 30 - 3 η 12 ) ( η 30 - η 12 ) [ ( η 30 + η 12 ) 2 - 3 ( η 21 + η 03 ) 2 ] + ( 3 η 21 - η 03 ) ( η 21 + η 03 ) [ 3 ( η 30 + η 12 ) 2 - ( η 21 + η 03 ) 2 ] M 6 = ( η 20 - η 02 ) [ ( η 30 + η 12 ) 2 - ( η 21 + η 03 ) 2 ] + 4 η 11 ( η 30 + η 12 ) ( η 21 + η 03 ) M 7 = ( 3 η 12 - η 30 ) ( η 30 + η 12 ) [ ( η 30 + η 12 ) 2 - 3 ( η 21 + η 03 ) 2 ] ( 3 η 21 - η 03 ) ( η 21 + η 03 ) [ 3 ( η 03 + η 12 ) 2 - ( η 12 + η 30 ) 2 ] - - - ( 4 )
(1c) foundation of standard character library: the feature of extracting the sample printer Chinese character image of different model, pass through learning process, sample training is become to the system corresponding eigenwert of the identical Chinese character of different model printer storehouse used, take the most frequently used standard Chinese character as object, font is respectively the conventional Song typeface, imitation Song-Dynasty-Style typeface, regular script, black matrix, lishu and children's circle, font size is one to No. six, choose the HU moment characteristics value of simplification, for Chinese character to be identified, adopt the mode of secondary coding: the total pixel number that first obtains Chinese character image, these statistical informations of number of hits, mate with feature Chinese character base successively, complete the thick level classification to Chinese character base, result is as the match objects of next step identification, secondly, on the basis of previous step, utilize HU moment characteristics value, this Chinese character of deeper sign, until the characteristic information characterizing can only pick out the Chinese character in image,
(2) classifier design: by the contrast of word to be checked and standard word planting modes on sink characteristic value, realize the discriminating of file printout machine type; Under many factors restriction, in the time processing large character set identification problem, select minimum distance classifier; Adopt strategy thick, disaggregated classification two-stage classification based on Confidence Analysis to complete the judgement of classification under Chinese character to be identified:
(2a) rough sort: design a kind of Euclidean distance sorter, establish M ii HU moment characteristics value of font to be identified,
Figure FSA0000097112540000023
be i standard HU moment characteristics average of k kind font, when condition, font to be identified is considered to k below meet 0plant font, wherein G is font classification number;
k 0 = arg min 1 ≤ k ≤ G { Σ i = 1 4 ( M i - M i k ) 2 } - - - ( 5 )
(2b) disaggregated classification: adopt correction secondary Discrimination Functions MQDF to measure as disaggregated classification, it is a distortion of mahalanobis distance, and its functional form is:
g i ( X ) = 1 h 2 { Σ i = 1 d ( x i - m ij ) 2 - Σ i = 1 k ( 1 - h 2 λ ij ) 2 [ ( X - M i ) T φ ij ] 2 + ln ( h 2 ( d - k ) Π j = 1 K λ ij ) } - - - ( 6 )
Wherein λ ijand φ iji the eigenwert and the proper vector that are respectively the covariance matrix of j class sample, K represents the number of intercepted main latent vector, i.e. the principal subspace dimension of Pattern Class, its optimal value is definite by testing, h 2that the experiment of little eigenvalue is estimated; What MQDF produced is second judgement curved surface, because only needing to estimate front K main latent vector of each classification covariance matrix, has avoided the negative effect of little eigenvalue evaluated error; MQDF differentiates that distance can be regarded as the weighted sum of the mahalanobis distance in d dimension principal subspace and the Euclidean distance in remaining (d-K) dimension space, and weighting factor is 1/h 2;
(2c) confidence calculations: the output Candidate Set of establishing rough sort device is { (c 1, d 1), (c 2, d 2) ... (c n, d n), n is Candidate Set capacity, c nand d nwei candidate characters and corresponding rough sort distance; If rough sort result is can property enough high, in other words, if c 1while being the correct classification of input character, disaggregated classification need not carry out; According to the degree of confidence f of rough segmentation result consize determine whether need to carry out disaggregated classification, adopt the distance of output as tolerance, according to lower calculating degree of confidence:
f con=(d 2-d 1)/d 1 (7)
Degree of confidence during lower than set threshold value, is sent rough sort Candidate Set into the processing of disaggregated classification device, otherwise is directly exported rough sort result.
CN201310538041.1A 2013-10-29 2013-10-29 The mimeograph documents discrimination method analyzed based on printing character library Active CN103810484B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310538041.1A CN103810484B (en) 2013-10-29 2013-10-29 The mimeograph documents discrimination method analyzed based on printing character library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310538041.1A CN103810484B (en) 2013-10-29 2013-10-29 The mimeograph documents discrimination method analyzed based on printing character library

Publications (2)

Publication Number Publication Date
CN103810484A true CN103810484A (en) 2014-05-21
CN103810484B CN103810484B (en) 2017-10-10

Family

ID=50707225

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310538041.1A Active CN103810484B (en) 2013-10-29 2013-10-29 The mimeograph documents discrimination method analyzed based on printing character library

Country Status (1)

Country Link
CN (1) CN103810484B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104468090A (en) * 2014-11-12 2015-03-25 辽宁大学 Chinese character password encoding method based on image pixel coordinates
CN104965928A (en) * 2015-07-24 2015-10-07 北京航空航天大学 Chinese character image retrieval method based on shape matching
CN105825211A (en) * 2016-03-17 2016-08-03 世纪龙信息网络有限责任公司 Method, device and system for recognizing name card
CN105930763A (en) * 2015-02-27 2016-09-07 联想(新加坡)私人有限公司 Ink Stroke Grouping Method And Product Based On Stroke Attributes
CN106530317A (en) * 2016-09-23 2017-03-22 南京凡豆信息科技有限公司 Stick figure computer scoring and auxiliary coloring method
CN110781727A (en) * 2019-09-12 2020-02-11 中国刑事警察学院 Laser printing file quantitative inspection method based on image physical measurement indexes
CN110837326A (en) * 2019-10-24 2020-02-25 浙江大学 Three-dimensional target selection method based on object attribute progressive expression
CN111027345A (en) * 2018-10-09 2020-04-17 北京金山办公软件股份有限公司 Font identification method and apparatus
CN111507332A (en) * 2020-04-17 2020-08-07 上海眼控科技股份有限公司 Vehicle VIN code detection method and equipment
CN113761231A (en) * 2021-09-07 2021-12-07 浙江传媒学院 Text character feature-based text data attribution description and generation method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1110801A (en) * 1994-09-27 1995-10-25 张志国 Processing system for script characters
CN1210296A (en) * 1997-08-29 1999-03-10 王伟 Korean language database under UCDOS platform and its inputting method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1110801A (en) * 1994-09-27 1995-10-25 张志国 Processing system for script characters
CN1210296A (en) * 1997-08-29 1999-03-10 王伟 Korean language database under UCDOS platform and its inputting method

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104468090A (en) * 2014-11-12 2015-03-25 辽宁大学 Chinese character password encoding method based on image pixel coordinates
CN104468090B (en) * 2014-11-12 2017-07-28 辽宁大学 Character cipher coding method based on image pixel coordinates
CN105930763A (en) * 2015-02-27 2016-09-07 联想(新加坡)私人有限公司 Ink Stroke Grouping Method And Product Based On Stroke Attributes
CN105930763B (en) * 2015-02-27 2019-07-26 联想(新加坡)私人有限公司 The method and product of handwritten stroke grouping based on stroke property
CN104965928A (en) * 2015-07-24 2015-10-07 北京航空航天大学 Chinese character image retrieval method based on shape matching
CN104965928B (en) * 2015-07-24 2019-01-22 北京航空航天大学 One kind being based on the matched Chinese character image search method of shape
CN105825211B (en) * 2016-03-17 2019-05-31 世纪龙信息网络有限责任公司 Business card identification method, apparatus and system
CN105825211A (en) * 2016-03-17 2016-08-03 世纪龙信息网络有限责任公司 Method, device and system for recognizing name card
CN106530317A (en) * 2016-09-23 2017-03-22 南京凡豆信息科技有限公司 Stick figure computer scoring and auxiliary coloring method
CN106530317B (en) * 2016-09-23 2019-05-24 南京凡豆信息科技有限公司 A kind of scoring of simple picture computer and auxiliary painting methods
CN111027345A (en) * 2018-10-09 2020-04-17 北京金山办公软件股份有限公司 Font identification method and apparatus
CN110781727A (en) * 2019-09-12 2020-02-11 中国刑事警察学院 Laser printing file quantitative inspection method based on image physical measurement indexes
CN110781727B (en) * 2019-09-12 2022-06-17 中国刑事警察学院 Laser printing file quantitative inspection method based on image physical measurement indexes
CN110837326A (en) * 2019-10-24 2020-02-25 浙江大学 Three-dimensional target selection method based on object attribute progressive expression
CN111507332A (en) * 2020-04-17 2020-08-07 上海眼控科技股份有限公司 Vehicle VIN code detection method and equipment
CN113761231A (en) * 2021-09-07 2021-12-07 浙江传媒学院 Text character feature-based text data attribution description and generation method

Also Published As

Publication number Publication date
CN103810484B (en) 2017-10-10

Similar Documents

Publication Publication Date Title
CN103810484A (en) Print file identification method based on print font library analysis
Srihari et al. Establishing handwriting individuality using pattern recognition techniques
Srihari et al. Individuality of handwriting
Bhattacharya et al. Offline recognition of handwritten Bangla characters: an efficient two-stage approach
Srihari et al. Individuality of handwriting: a validation study
US7580551B1 (en) Method and apparatus for analyzing and/or comparing handwritten and/or biometric samples
Oliveira et al. The graphology applied to signature verification
CN101447017B (en) Method and system for quickly identifying and counting votes on the basis of layout analysis
CN101482920B (en) Hand-written character recognition method and system
CN100589119C (en) Multi-font multi- letter size print form charater recognition method based on 'Yi' character set
Mahmoud Recognition of writer-independent off-line handwritten Arabic (Indian) numerals using hidden Markov models
Fazilov et al. State of the art of writer identification
Bulacu et al. Automatic handwriting identification on medieval documents
CN102982343B (en) Handwritten number recognition and incremental type obscure support vector machine method
Patel et al. Handwritten character recognition using multiresolution technique and euclidean distance metric
Biswas et al. Writer identification of Bangla handwritings by radon transform projection profile
Srihari et al. A system for handwriting matching and recognition
Diaz-Cabrera et al. Emerging issues for static handwritten signature biometrics
Siddiqi et al. Contour based features for the classification of ancient manuscripts
Abdalkafor et al. A novel comprehensive database for Arabic and English off-line handwritten digits recognition
Zhang et al. Computational method for calligraphic style representation and classification
Zhang et al. Style comparisons in calligraphy
Zhu et al. Segmentation of on-line freely written Japanese text using SVM for improving text recognition
Halder et al. Individuality of isolated Bangla numerals
bin Abdl et al. Handwriting identification: a direction review

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant