CN106503706B - The method of discrimination of Chinese character pattern cutting result correctness - Google Patents

The method of discrimination of Chinese character pattern cutting result correctness Download PDF

Info

Publication number
CN106503706B
CN106503706B CN201610847230.0A CN201610847230A CN106503706B CN 106503706 B CN106503706 B CN 106503706B CN 201610847230 A CN201610847230 A CN 201610847230A CN 106503706 B CN106503706 B CN 106503706B
Authority
CN
China
Prior art keywords
font
stroke
component
value
cutting result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610847230.0A
Other languages
Chinese (zh)
Other versions
CN106503706A (en
Inventor
宋伟康
连宙辉
唐英敏
肖建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201610847230.0A priority Critical patent/CN106503706B/en
Publication of CN106503706A publication Critical patent/CN106503706A/en
Application granted granted Critical
Publication of CN106503706B publication Critical patent/CN106503706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Character Discrimination (AREA)

Abstract

Correctness, which is carried out, the invention discloses a kind of result for the font cutting algorithm sentences method for distinguishing, belong to Chinese-character stroke and component automatically extracts field, this method successively includes differentiation process, the differentiation process based on part classification, the differentiation process based on font attribute and the differentiation process based on glyph skeleton rebuild based on font, when the judgement result that any differentiation process is differentiated for font cutting result to be discriminated is wrong cutting result, the font cutting result to be discriminated is determined as font miscut.Provided font cutting result method of discrimination through the invention can recognize that 97% or more wrong cutting result.Therefore, the present invention can effectively differentiate the case where miscut.

Description

The method of discrimination of Chinese character pattern cutting result correctness
Technical field
The invention belongs to Chinese-character strokes and component to automatically extract field, be related to a kind of differentiation side of Chinese character pattern cutting result Method, and in particular to a kind of to utilize four kinds of font reconstruction, font attribute, part classification and glyph skeleton distinguished numbers to a kind of font The correctness for the cutting result that cutting algorithm obtains carries out sentencing method for distinguishing.
Background technique
The font cutting technique of Chinese character includes automatically extracting and automatically extracting to Hanzi component to Chinese-character stroke.Chinese character Font cutting technique is initially as the pre-treatment step in optical character recognition process, the result aided Chinese characters cut using font The identification of character.With the continuous development of font computing technique, Chinese character pattern cutting technique becomes Hanzi font library and automatically generates skill Art, handwriting verification, Chinese character auxiliary are write, and the core technology in the research topics such as digital ink, correlative study also occurs vigorously Development.
(Sun, Hao, Zhouhui Lian, Yingmin Tang, and Jianguo the Xiao. " Non- of document Sun 2014 rigid point set registration for Chinese characters using structure-guided coherent point drift."2014IEEE International Conference on Image Processing (ICIP), it describes and a kind of is focused based on the non-rigid point for using structural information to instruct in pp.4752-4756.IEEE, 2014.) The font cutting method of volume algorithm, this method are always divided into four steps to the cutting of font.The first step, to font to be cut and regular script Skeletal extraction is carried out to its corresponding template font with component information in database, obtain two skeletal extractions as a result, We are referred to as data point set and template point set.Second step returns the component of data point set, template point set and template point set The result of input of the category relationship as the non-rigid point set registration algorithm based on structural information, algorithm output is data point set Component attaching relation.The attaching relation of data point set is converted into the attaching relation of data profile section by third step.Each portion at this time It is inc that the corresponding contour segment of part is likely to be interruption.4th step, the contour segment generated to previous step are correctly closed, To obtain complete component result.
The sub- font that font cutting technique obtains after cutting to font is known as " component ".Due to the complexity of Chinese character pattern, The influence of the factors such as the accuracy rate of uncertainty and algorithm when people write, the cutting result that font cutting method obtains It is not always completely correct.For example, to the cutting result that the above-mentioned Sun 2014 font cutting method recorded obtains, the prior art It is difficult to realize carry out obtained cutting result the differentiation of correctness.Therefore, lack a kind of pair of font cutting method at present to obtain Cutting result carry out correctness differentiation technology.
Summary of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention provide it is a kind of for Chinese character pattern cutting algorithm result just The method of discrimination of true property for the font cutting algorithm proposed in Sun 2014, it can be achieved that exist that mistake is cut as a result, carrying out The differentiation of correctness.
For convenience of description, the present invention arranges following term definition:
Component: the result obtained after being cut to font.
Average stroke width: with the numbers of all black pixel points of Chinese character image divided by of Chinese character image marginal point Number.
Present invention provide the technical scheme that
A kind of method of discrimination of Chinese character pattern cutting result correctness, successively include based on font rebuild differentiation process, Differentiation process based on part classification, the differentiation process based on font attribute and the differentiation process based on glyph skeleton;Specific packet It includes:
1) the differentiation process rebuild based on font:
It cuts each Chinese character pattern to obtain component, be spliced again according to original position, to obtained splicing font The comparison of pixel scale is carried out with former font, statistics obtains difference pixel value;Difference pixel value threshold value is set, further according to setting Difference pixel value threshold value is differentiated, the font is determined as wrong cutting result or correct cutting result;
2) the differentiation process based on part classification:
Firstly, one parts data collection being made of correct component of building, one portion of training on the parts data collection Part classifier;Then, classified using the part classification device that training obtains to font cutting result to be discriminated, classified Result be correct component or incorrect part;
3) the differentiation process based on font attribute:
The corresponding font attribute of correct component is set, when component does not meet corresponding font attribute, determines the component It is the result that font miscut generates;
4) the differentiation process based on glyph skeleton:
The detection that the profile middle section of each stroke in Chinese character pattern is carried out to flatness, when there is profile in stroke middle section When mutation, the font cutting result mistake is determined.
For the method for discrimination of above-mentioned Chinese character pattern cutting result correctness, further, rebuild described based on font Differentiation process, the differentiation process based on part classification, the differentiation process based on font attribute and the differentiation based on glyph skeleton It in the process, is mistake cutting for the judgement result that font cutting result to be discriminated is differentiated when any differentiation process When as a result, the font cutting result to be discriminated is determined as font miscut.
For the method for discrimination of above-mentioned Chinese character pattern cutting result correctness, further, it is described based on font rebuild During differentiation, the difference pixel value threshold value is set as square of stroke width.
For the method for discrimination of above-mentioned Chinese character pattern cutting result correctness, further, it is described based on font rebuild Differentiation process specifically comprises the following steps:
11) according to the size of former font, a difference value matrix equal with former word size, the difference value matrix are generated All elements be initialized as 0;
12) each component that traversal cutting obtains as a result, according to the position in original image, the image of component is corresponding To one of difference value matrix and the equal sized region of image of component, then by the corresponding region of image of component and difference value matrix Carry out the cumulative of pixel scale;The cumulative of all components is completed, difference value matrix is obtained;
13) an absent region matrix and an extraneous region matrix equal with former word size are generated respectively;It is described The all elements of two matrixes are initialized as 0, while traversing each pixel of former font and the correspondence of the difference value matrix Position pixel;Former font pixel value is set in two kinds of situation;The first situation be former font pixel value be 0, another situation is that Former font pixel value is 1;For the first situation, when the difference value matrix respective pixel value is 1, by the extraneous region The value of matrix corresponding position is set as 1, is otherwise set as 0;It, will when the difference value matrix respective pixel is 0 for second situation The value of the absent region matrix corresponding position is set as 1, is otherwise set as 0;
14) connected region detection is carried out to the absent region matrix and extraneous region matrix respectively, obtains two matrixes The number of pixels of all connected regions;Connected region pixel threshold is set, when the sum of all pixels of any one connected region is super When crossing the connected region pixel threshold, which is determined as wrong cutting result, is otherwise determined as correct cutting result.
It is further, described based on part classification for the method for discrimination of above-mentioned Chinese character pattern cutting result correctness During differentiation, the training of part classification device includes the following steps:
21) image preprocessing performs the following operations:
Image of component is zoomed in and out by the way of non-rigid scaling, each image of component is being normalized to one just Rectangular image, square side length are denoted as L;
22) select Local Subgraphs picture, to local subgraph carry out local feature extraction, obtain multiple local features to Amount, is denoted as num_lf for the number of local feature vectors;
23) dictionary constructs:
In obtained local feature vectors, stochastical sampling obtains plurality of local feature as total characteristic set; The number value range of local feature should be greater than 10000;Maximum can be set to whole local feature vectors number num_lf;It adopts Num_k cluster centre is obtained with K mean cluster algorithm, as sparse dictionary;The value range of num_k is 256 to whole offices The number num_lf of portion's feature;
24) rarefaction representation:
According to sparse dictionary obtained in the previous step, carried out using the sparse coding algorithm local feature all to a component Coding;Then all local features are combined using maximum value pond algorithm, obtaining a dimension is the sparse of num_k Indicate feature, the quantity of dimension and the quantity of cluster centre are equal;
25) classifier training: the rarefaction representation feature is trained using linear SVM algorithm, obtains portion Part classifier.
It is further, described based on part classification for the method for discrimination of above-mentioned Chinese character pattern cutting result correctness During differentiation, the classification specifically calculate the component categories that the font cutting result to be discriminated is obtained using classifier and Whether the component categories that should belong to are identical to differentiate;When the classification difference that the classification results and component should belong to, which sentences Not Wei wrong cutting result, when the classification results are identical with the classification that component should belong to, which is determined as correctly cutting and tie Fruit.
Further, the step 22) local shape factor, specifically performs the following operations:
22a. carries out uniform grid cutting to local subgraph, obtains multiple regions, is set as n*n;It sets a trap portion's subgraph Side length be L_sub, obtain each region side length be L_sub divided by n;The n*n region is uniformly drawn by Local Subgraphs picture Point, from each other without intersection, it is combined the Local Subgraphs picture constituted before cutting just;It is calculated in each region using Sobel Son carries out convolution, obtains the result of amplitude and phase;
Uniform phase is divided into n*n section by 22b., and each interval statistics obtain phase in Local Subgraphs picture and fall in the area Between pixel range value summation;Obtain the local feature of n*n dimension;
22c. splices the local feature of the respective dimensions in the region n*n, obtains the local feature of multidimensional.
It is further, described based on font attribute for the method for discrimination of above-mentioned Chinese character pattern cutting result correctness During differentiation, the font attribute includes part dimension attribute and component area attribute.
It is further, described based on glyph skeleton for the method for discrimination of above-mentioned Chinese character pattern cutting result correctness Specific step is as follows for differentiation process:
41) set in the component that a font cutting result obtains that stroke number is N, on each component outline Point obtains N number of value, respectively represents minimum distance of the profile point apart from N number of stroke skeleton;
42) minimum distance of a profile point and all stroke skeletons is minimized, the stroke bone of minimum value will be got The stroke that frame should belong to as profile point, using the minimum value got as the distance of profile point and stroke skeleton;
43) it constructs profile point set: setting N number of profile point set, be initialized as empty set, i-th of set represents i-th The profile of picture;All profile points are traversed, when profile point belongs to i-th of stroke, this profile point is added to i-th of collection It closes;It is the starting M% and end M% of stroke by the nearest stroke skeletal point of distance in each set after the building for completing set Profile point removal, remaining profile point is exactly the profile point in stroke middle section;
44) profile point of each set is acquired to the mode of affiliated stroke skeleton distance, as being averaged for this section of stroke Stroke width;When profile point has been more than K times of average stroke width to affiliated stroke skeleton distance, determine that the profile point is It is mutated profile;When the quantity for being mutated profile point has been more than preset mutation profile point amount threshold, the component is sentenced It Wei not the wrong component cut.
Further, the circular of minimum distance of the step 41) profile point apart from N number of stroke skeleton It is: for a profile point, traverses N number of stroke skeleton, each stroke skeleton traverses all skeletal points and profile point is calculated With the distance of these stroke skeletal points, take the minimum value of distance as the profile point apart from the nearest of the stroke skeleton currently traversed Distance;The value range of the step 43) M is 0 to 50;The value range of the step 44) K is 0.8 to 3;The mutation wheel Exterior feature point amount threshold is X times of average stroke width, and the value range of X is 0.7 to 3.
Compared with prior art, the beneficial effects of the present invention are:
The present invention, which provides a kind of result for the font cutting algorithm and carries out correctness, sentences method for distinguishing, and this method is successively Differentiation process including being rebuild based on font, the differentiation process based on part classification, the differentiation process based on font attribute and base In the differentiation process of glyph skeleton, when any differentiation process is directed to the judgement that font cutting result to be discriminated is differentiated When being as a result wrong cutting result, the font cutting result to be discriminated is determined as font miscut.Through the invention Provided font cutting result method of discrimination can recognize that 97% or more wrong cutting result.Therefore, the present invention can Effective the case where differentiating miscut.
Detailed description of the invention
Fig. 1 is font cutting result example;
Wherein, (a) is the correct font cutting result that font cutting technique obtains;(b) it is obtained for font cutting technique Mistake font cutting result.
Fig. 2 is the flow diagram of font cutting result correctness method of discrimination provided by the invention.
Fig. 3 is to carry out differentiation process screenshot to font cutting result correctness based on glyph skeleton in the embodiment of the present invention;
Wherein, (a) is image of component to be discriminated;It (b) is the glyph skeleton of the component;(c) for belong to the component the The schematic diagram of the marginal point of five strokes " cross ";(d) the break edge point to determine.
Specific embodiment
With reference to the accompanying drawing, the present invention, the model of but do not limit the invention in any way are further described by embodiment It encloses.
All components that former word image, the Chinese character of available Chinese character are cut after the operation of font cutting algorithm is completed The stroke that classification belonging to image, component and component include.The present invention provide a kind of result for the font cutting algorithm into Row correctness sentences method for distinguishing, and this method successively includes the differentiation process rebuild based on font, based on the differentiation of part classification Journey, the differentiation process based on font attribute and the differentiation process based on glyph skeleton, Fig. 2 are font cutting knots provided by the invention The flow diagram of fruit correctness method of discrimination, detailed process is as follows:
1) the differentiation process rebuild based on font
Differentiation process based on font reconstruction carries out weight by cutting a Chinese character pattern to obtain component, according to original position New splicing, carries out the comparison of pixel scale to obtained splicing font and former font, statistics correspond to each other but pixel value but not The number of same pixel, referred to as difference pixel value.Set difference pixel value threshold value, further according to setting difference pixel value threshold value into Row differentiates.Threshold value can be set as square of stroke width.Following four step can be specifically divided into:
11) according to the size of former font, a difference value matrix equal with former word size, the difference value matrix are generated All elements be initialized as 0.
12) each component that traversal cutting obtains as a result, according to the position in original image, the image of component is corresponding To one and the equal sized region of image of component of difference value matrix, then by the correspondence area of image of component and difference value matrix Domain carries out the cumulative of pixel scale.Complete all components it is cumulative after, obtain final difference value matrix.
13) according to the size of former font, an absent region matrix and a size equal with former word size are generated Equal extraneous region matrix.The all elements of the two matrixes are initialized as 0.Each pixel of former font is traversed simultaneously, And the corresponding position pixel of difference value matrix.Next in two kinds of situation, it is 0 that the first situation, which is former font pixel value, separately A kind of situation is that former font pixel value is 1.In the case of the first, if difference value matrix respective pixel value is 1, by extra area The value of domain matrix corresponding position is set as 1, is otherwise set as 0.It, will if difference value matrix respective pixel is 0 under second situation The value of absent region matrix corresponding position is set as 1, is otherwise set as 0.
14) connected region detection is carried out to absent region matrix and extraneous region matrix respectively, it is all obtains two matrixes Connected region number of pixels.Connected region pixel threshold is set, which is a nonnegative real number, according to the connection of setting The font is determined as wrong cutting if the sum of all pixels of any one connected region is more than threshold value by area pixel threshold value As a result, being otherwise determined as correct cutting result.
2) the differentiation process based on part classification
The purpose of distinguished number based on part classification is component class that font cutting result should belong to it in order to obtain Other similarity degree, similarity degree are measured by the classification results of classifier, if the classification results of classifier are cut with the font It is identical to cut the component categories that result should belong to, then it is assumed that similarity degree is high, and font cutting result is correct.If the result of classifier It is different from the component categories that the font cutting result should belong to, then it is assumed that similarity degree is low, font cutting result mistake.Algorithm Basic ideas are to construct a parts data collection being made of correct component first, and parts data concentration includes that component should belong to Classification, then one part classification device of training on this parts data collection treated using the obtained part classification device of training The font cutting result of differentiation is classified, the result classified.If the component that the classification results and parts data are concentrated The classification that should belong to is different, then is determined as incorrect part.The training step of part classification device is as follows:
21) image preprocessing;
Image of component is zoomed in and out by the way of non-rigid scaling.Will each image of component be normalized to one Square-shaped image, square side length are denoted as L, and the value standard of L is that image of component will not generate larger distortion, when practical application It can choose the arbitrary integer between 64 to 256.A benefit using non-rigid scaling is exactly can be indirectly to component diagram As being corrected.
22) local shape factor;
In local shape factor, the Local Subgraphs picture of this algorithms selection is square, and side length is four points of picture size One of to any one value between 3/4ths ranges.L can be taken divided by 2 when practical application.On the whole, first to component Image carries out the stochastical sampling of marginal point, 200 to 600 sampled points of each subassembly selection.It is extracted centered on each sampled point Then Local Subgraphs picture carries out the extraction of local feature to local subgraph.The extracting mode of local feature is as follows:
22a. carries out the 4 uniform grid cuttings for multiplying 4 to local subgraph, obtains 16 regions.It sets a trap the side of portion's subgraph A length of L_sub, then the side length in each region is that Local Subgraphs picture is evenly dividing by L_sub divided by 4,16 regions, from each other Without intersection, it is combined and then constitutes the Local Subgraphs picture before cutting just.Then it is carried out in each region using Sobel operator Convolution obtains the result of amplitude and phase;
Uniform phase is divided into 16 sections by 22b., and phase falls in the picture in the section in each interval statistics Local Subgraphs picture The summation of the range value of vegetarian refreshments.The local feature of available one 16 dimension after all pixel is counted.
The 16 dimension local features in 16 regions are spliced to obtain the local feature of 256 dimensions by 22c..
23) dictionary constructs;
After concentrating all components to carry out local shape factors parts data, available num_lf local feature to Amount obtains several local features as total characteristic set in wherein stochastical sampling, and the number value range of local feature should be big In 10000, maximum can be set to whole local feature numbers.Then obtained in num_k cluster using K mean cluster algorithm For the heart as sparse dictionary, the value range of num_k is the 256 number num_lf for arriving whole local features.
24) rarefaction representation;
According to sparse dictionary obtained in the previous step, carried out using the sparse coding algorithm local feature all to a component All local features are then combined using maximum value pond algorithm by coding, and obtaining a dimension is the sparse of num_k Indicate feature, the quantity of dimension and the quantity of cluster centre are equal.
25) classifier training;
Rarefaction representation feature is trained using linear SVM algorithm, obtains part classification device.
3) the differentiation process based on font attribute;
Based on the distinguished number of font attribute according to the characteristics of Chinese character and rule that component defines, set a series of correct The font attribute that component should have, when component does not meet a certain font attribute, then it is assumed that the component is font miscut The result of generation.Font attribute is defined as follows:
Part dimension attribute: the width and height of correct component are at least greater than one times of average stroke of the affiliated font of component Width.The component that font cutting technique defines can include at least a stroke, therefore the image of component size that cutting obtains is at least It is greater than one times of average stroke width.
Component area attribute: the sum of all pixels of the connected region in image of component is at least greater than the flat of one times of stroke width Side.Obviously, the smallest stroke should be " point " in component, therefore the connected region for cutting obtained image of component is at least greater than The size of equal to one stroke " point ".Here it is considered that the size of the smallest " point " is approximately equal to square of one times of stroke width.
4) based on the distinguished number of glyph skeleton;
It, will be every based on the distinguished number of glyph skeleton according to the stroke middle section relatively intrinsic feature of this smooth Chinese character The profile middle section of a stroke carries out the detection of flatness, if there is profile catastrophe in stroke middle section, then it is assumed that font is cut Cut result mistake.Specific step is as follows:
41) set in the component that a font cutting result obtains that stroke number is N, then on each component outline Point, obtain N number of value, respectively represent minimum distance of the profile point apart from N number of stroke skeleton.Circular is: for One profile point, traverses N number of stroke skeleton, and each stroke skeleton traverses all skeletal points and profile point and these pens is calculated The distance for drawing skeletal point, the minimum distance for the stroke skeleton for taking the minimum value of distance currently to traverse as profile point distance.
42) minimum distance of a profile point and all stroke skeletons is minimized, gets the stroke skeleton of minimum value It is exactly the stroke that profile point should belong to, the minimum value got is exactly the distance of profile point and stroke skeleton.
43) N number of profile point set is set, empty set is initialized as, i-th of set represents the profile of i-th of stroke.Traversal institute This profile point is added to i-th of set if profile point belongs to i-th of stroke by some profile points.Complete the building of set Later, the profile point for originating M% and end M% that the nearest stroke skeletal point of distance in each set is stroke is removed, is left Be exactly stroke middle section profile point, the value range of M is 0 to 50.
44) it asks the profile point of each set to the mode of affiliated stroke skeleton distance, and thinks that this distance is the section The average stroke width of stroke.If profile point is recognized to K times that affiliated stroke skeleton distance has been more than average stroke width It is mutation profile for the profile point, the value range of K is 0.8 to 3.When the quantity of mutation profile point has been more than preset threshold This component is then determined as the component of mistake cutting by value, and mutation profile point amount threshold is X times of average stroke width, X's Value range is 0.7 to 3.
5) categorised decision method is set.The categorised decision method of this method is set as, as long as by above-mentioned any algorithm It is determined as mistake, then font cutting result is determined as mistake.
Fig. 1 is font cutting result example;Wherein, (a) is the correct font cutting result that font cutting technique obtains; (b) the font cutting result of the mistake obtained for font cutting technique.It is to judge that font is cut using the method provided by the present invention below Cut the embodiment of result correctness.
1, with the algorithm rebuild based on font, the picture of the connected region of absent region matrix and extraneous region matrix is counted Plain number.If the sum of all pixels of any one connected region is more than square of one times of stroke width, which is determined as Mistake cutting result, algorithm terminate.Otherwise enter 2.Specifically comprising the steps of:
A. according to the size of former font, a difference value matrix equal with former word size, the difference value matrix are generated All elements be initialized as 0.
B. each component that traversal cutting obtains as a result, according to the position in original image, the image of component is corresponding To one and the equal sized region of image of component of difference value matrix, then by the correspondence area of image of component and difference value matrix Domain carries out the cumulative of pixel scale.Complete all components it is cumulative after, obtain final difference value matrix.
C. according to the size of former font, an absent region matrix and a size phase equal with former word size are generated Deng extraneous region matrix.The all elements of the two matrixes initialize 0.Each pixel of former font is traversed simultaneously, and The corresponding position pixel of difference value matrix.Next in two kinds of situation, it is 0 that the first situation, which is former font pixel value, another Be former font pixel value be 1.In the case of the first, if difference value matrix respective pixel value is 1, by extraneous region matrix pair It answers the value of position to be set as 1, is otherwise set as 0.Under second situation, if difference value matrix respective pixel is 0, by absent region The value of matrix corresponding position is set as 1, is otherwise set as 0.
D. connected region detection is carried out to absent region matrix and extraneous region matrix respectively, two obtained matrixes are all Connected region number of pixels.According to the connected region pixel threshold of setting, if the pixel of any one connected region is total The font is then determined as wrong cutting result, is otherwise determined as correct cutting result by number more than square of one times of stroke width.
2. utilizing the distinguished number based on part classification, obtained part classification device divides the classification of each component Class.If there is the classification that the classification of a component should belong to it is not met, then the font is determined as wrong cutting result, Algorithm terminates.Otherwise enter 3.
3. the distinguished number based on font attribute is used, as long as there is a component not input any one font attribute, The font is determined as wrong cutting result, algorithm terminates.Otherwise enter 4.
4. using the distinguished number based on stroke skeleton, specific embodiment is as follows:
A. as shown in Fig. 3, stroke number is 5 in the image of component, available for the point on each component outline 5 values respectively represent minimum distance of the profile point apart from each stroke skeleton.Circular be for a profile point, 5 stroke skeletons are traversed, each stroke skeleton traverses all skeletal points and profile point and these stroke skeletal points is calculated Distance, the minimum distance for the stroke skeleton for taking the minimum value of distance currently to traverse as profile point distance.
As soon as being b. minimized the minimum distance of a profile point and all stroke skeletons, the stroke skeleton of minimum value is got It is the stroke that profile point should belong to, the minimum value got is exactly the distance of profile point and stroke skeleton.Set 5 profile point sets It closes, is initialized as empty set, i-th of set represents the profile of i-th of stroke.All profile points are traversed, if profile point belongs to i-th This profile point is then added to i-th of set by a stroke.It is after the building for completing set, distance in each set is nearest Stroke skeletal point is the starting 20% of stroke and the profile point removal at end 20%, and remaining is exactly the profile point in stroke middle section.
C. it asks the profile point of each set to the mode of affiliated stroke skeleton distance, and thinks that this distance is this section of pen The average stroke width drawn.If profile point is to 1 times that affiliated stroke skeleton distance has been more than average stroke width, then it is assumed that The profile point is mutation profile.Such as the profile unusual part in attached drawing 3.When the quantity of mutation profile point has been more than 0.8 times of stroke This component is then determined as the component of mistake cutting by width.
It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field Art personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Subject to the range that book defines.

Claims (10)

1. a kind of method of discrimination of Chinese character pattern cutting result correctness successively includes the differentiation process rebuild based on font, base In the differentiation process of part classification, the differentiation process based on font attribute and the differentiation process based on glyph skeleton;It specifically includes:
1) the differentiation process rebuild based on font:
It cuts each Chinese character pattern to obtain component, be spliced again according to original position, to obtained splicing font and original Font carries out the comparison of pixel scale, and statistics obtains difference pixel value;Difference pixel value threshold value is set, further according to the difference of setting Pixel value threshold value is differentiated, the font is determined as wrong cutting result or correct cutting result;
2) the differentiation process based on part classification:
Firstly, one parts data collection being made of correct component of building, training one component point on the parts data collection Class device;Then, classified using the part classification device that training obtains to font cutting result to be discriminated, obtained to be discriminated The classification of font cutting result is correct component or incorrect part;
3) the differentiation process based on font attribute:
The corresponding font attribute of correct component is set, when component does not meet corresponding font attribute, determines that the component is word The result that shape miscut generates;
4) the differentiation process based on glyph skeleton:
The detection that the profile middle section of each stroke in Chinese character pattern is carried out to flatness, when there is profile mutation in stroke middle section When, determine the font cutting result mistake.
2. the method for discrimination of Chinese character pattern cutting result correctness as described in claim 1, characterized in that be based on font described The differentiation process of reconstruction, the differentiation process based on part classification, the differentiation process based on font attribute and based on glyph skeleton During differentiation, when the judgement result that any differentiation process is differentiated for font cutting result to be discriminated is mistake When cutting result, the font cutting result to be discriminated is determined as font miscut.
3. the method for discrimination of Chinese character pattern cutting result correctness as described in claim 1, characterized in that described to be based on font weight During the differentiation built, the difference pixel value threshold value is set as square of stroke width.
4. the method for discrimination of Chinese character pattern cutting result correctness as described in claim 1, characterized in that described to be based on font weight The differentiation process built specifically comprises the following steps:
11) according to the size of former font, a difference value matrix equal with former word size, the institute of the difference value matrix are generated There is element to be initialized as 0;
12) traversal cutting obtain each component as a result, according to the position in original image, which is corresponded into difference One and the equal sized region of image of component of different value matrix, then the corresponding region of image of component and difference value matrix is carried out Pixel scale adds up;The cumulative of all components is completed, difference value matrix is obtained;
13) an absent region matrix and an extraneous region matrix equal with former word size are generated respectively;Two matrixes All elements be initialized as 0, while traversing each pixel of former font and the corresponding position pixel of the difference value matrix; Former font pixel value is set in two kinds of situation;The first situation is that former font pixel value is 0, another situation is that former font pixel Value is 1;The extraneous region matrix is corresponded into position when the difference value matrix respective pixel value is 1 for the first situation The value set is set as 1, is otherwise set as 0;For second situation, when the difference value matrix respective pixel is 0, by the missing area The value of domain matrix corresponding position is set as 1, is otherwise set as 0;
14) connected region detection is carried out to the absent region matrix and extraneous region matrix respectively, it is all obtains two matrixes Connected region number of pixels;Connected region number of pixels threshold value is set, when total pixel in any one connected region When number is more than corresponding connected region number of pixels threshold value, which is determined as wrong cutting result, otherwise differentiates and is positive True cutting result.
5. the method for discrimination of Chinese character pattern cutting result correctness as described in claim 1, characterized in that described based on component point During the differentiation of class, the training of part classification device includes the following steps:
21) image preprocessing performs the following operations:
Image of component is zoomed in and out by the way of non-rigid scaling, each image of component is normalized to a square Image, square side length are denoted as L;
22) Local Subgraphs picture is selected, the extraction of local feature is carried out to local subgraph, obtains multiple local feature vectors, it will The number of local feature vectors is denoted as num_lf;
23) dictionary constructs:
In obtained local feature vectors, stochastical sampling obtains plurality of local feature as total characteristic set;Part The number value range of feature should be greater than 10000;Maximum can be set to whole local feature vectors number num_lf;Using K Means clustering algorithm obtains num_k cluster centre, as sparse dictionary;The value range of num_k is 256 special to all parts The number num_lf of sign;
24) rarefaction representation:
According to sparse dictionary obtained in the previous step, one all local feature of component is compiled using sparse coding algorithm Code;Then all local features are combined using maximum value pond algorithm, obtain the sparse table that a dimension is num_k Show that feature, the quantity of dimension and the quantity of cluster centre are equal;
25) classifier training is trained the rarefaction representation feature using linear SVM algorithm, obtains component point Class device.
6. the method for discrimination of Chinese character pattern cutting result correctness as claimed in claim 5, characterized in that the step 22) office Portion's feature extraction, specifically performs the following operations:
22a. carries out uniform grid cutting to local subgraph, obtains multiple regions, is set as n*n;It sets a trap the side of portion's subgraph A length of L_sub, the side length for obtaining each region are L_sub divided by n;Local Subgraphs picture is evenly dividing by the n*n region, mutually Without intersection between phase, it is combined the Local Subgraphs picture constituted before cutting just;It is carried out in each region using Sobel operator Convolution obtains the result of amplitude and phase;
Uniform phase is divided into n*n section by 22b., and each interval statistics obtain Local Subgraphs phase as in and fall in the section The summation of the range value of pixel;Obtain the local feature of n*n dimension;
22c. splices the local feature of the respective dimensions in the region n*n, obtains the local feature of multidimensional.
7. the method for discrimination of Chinese character pattern cutting result correctness as described in claim 1, characterized in that described based on component point During the differentiation of class, the classification is specifically: being obtained by calculating the font cutting result to be discriminated using classifier Component categories and whether the component categories that should belong to identical differentiates;When the classification results are different from the classification that component should belong to When, which is determined as wrong cutting result, and when the classification results are identical as the classification that component should belong to, which is determined as Correct cutting result.
8. the method for discrimination of Chinese character pattern cutting result correctness as described in claim 1, characterized in that described to be based on font category During the differentiation of property, the font attribute includes part dimension attribute and component area attribute.
9. the method for discrimination of Chinese character pattern cutting result correctness as described in claim 1, characterized in that described to be based on font bone Specific step is as follows for the differentiation process of frame:
41) stroke number is set in the component that a font cutting result obtains as N, for the point on each component outline, N number of value is obtained, minimum distance of the profile point apart from N number of stroke skeleton is respectively represented;
42) minimum value in minimum distance of the profile point apart from all stroke skeletons is taken, the stroke bone of minimum value will be got The stroke that frame should belong to as profile point, using the minimum value got as profile point and the stroke skeleton for getting minimum value away from From;
43) it constructs profile point set: setting N number of profile point set, be initialized as empty set, i-th of set represents i-th of stroke Profile;All profile points are traversed, when profile point belongs to i-th of stroke, this profile point is added to i-th of set;It is complete After the building of set, by stroke starting point and end end by the profile point in corresponding set remove a segment length, go Except the ratio that the length of stroke accounts for stroke total length is M%, remaining profile point is exactly the profile point in stroke middle section;
44) profile point of each set is acquired to the middle number of affiliated stroke skeleton distance, as being averaged for corresponding stroke skeleton Stroke width;When profile point has been more than K times of average stroke width to affiliated stroke skeleton distance, determine that the profile point is It is mutated profile;When the quantity for being mutated profile point has been more than preset mutation profile point amount threshold, the component is sentenced It Wei not the wrong component cut.
10. the method for discrimination of Chinese character pattern cutting result correctness as claimed in claim 9, characterized in that the step 41) wheel The circular of minimum distance of the exterior feature point apart from N number of stroke skeleton is: for a profile point, N number of stroke skeleton is traversed, Each stroke skeleton traverses the distance that profile point He these stroke skeletal points is calculated in all skeletal points, takes the minimum of distance The minimum distance for the stroke skeleton that value is currently traversed as profile point distance;The value range of the step 43) M is 0 to 50; The value range of the step 44) K is 0.8 to 3;The mutation profile point amount threshold is X times of average stroke width, X's Value range is 0.7 to 3.
CN201610847230.0A 2016-09-23 2016-09-23 The method of discrimination of Chinese character pattern cutting result correctness Active CN106503706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610847230.0A CN106503706B (en) 2016-09-23 2016-09-23 The method of discrimination of Chinese character pattern cutting result correctness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610847230.0A CN106503706B (en) 2016-09-23 2016-09-23 The method of discrimination of Chinese character pattern cutting result correctness

Publications (2)

Publication Number Publication Date
CN106503706A CN106503706A (en) 2017-03-15
CN106503706B true CN106503706B (en) 2019-06-07

Family

ID=58291008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610847230.0A Active CN106503706B (en) 2016-09-23 2016-09-23 The method of discrimination of Chinese character pattern cutting result correctness

Country Status (1)

Country Link
CN (1) CN106503706B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092917B (en) * 2017-03-24 2020-06-02 北京大学 Chinese character stroke automatic extraction method based on manifold learning
CN108154167B (en) * 2017-12-04 2021-08-20 昆明理工大学 Chinese character font similarity calculation method
CN110210476B (en) * 2019-05-24 2021-04-09 北大方正集团有限公司 Character component clustering method, device, equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819683A (en) * 2009-10-26 2010-09-01 杨光祥 Method for reconstructing Chinese character font
CN102968764A (en) * 2012-10-26 2013-03-13 北京航空航天大学 Chinese character image inpainting method based on strokes
JP2013214188A (en) * 2012-04-02 2013-10-17 Sharp Corp Character recognition processing device, character recognition processing method, character recognition processing program, and computer readable recording medium
CN104182748A (en) * 2014-08-15 2014-12-03 电子科技大学 A method for extracting automatically character strokes based on splitting and matching
CN104992161A (en) * 2015-07-17 2015-10-21 北京航空航天大学 Chinese character part dividing and structure determination method based on part identification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819683A (en) * 2009-10-26 2010-09-01 杨光祥 Method for reconstructing Chinese character font
JP2013214188A (en) * 2012-04-02 2013-10-17 Sharp Corp Character recognition processing device, character recognition processing method, character recognition processing program, and computer readable recording medium
CN102968764A (en) * 2012-10-26 2013-03-13 北京航空航天大学 Chinese character image inpainting method based on strokes
CN104182748A (en) * 2014-08-15 2014-12-03 电子科技大学 A method for extracting automatically character strokes based on splitting and matching
CN104992161A (en) * 2015-07-17 2015-10-21 北京航空航天大学 Chinese character part dividing and structure determination method based on part identification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Chinese Character Recognition Based on Character Reconstruction;Yun Li et al.;《2009 International Conference on Communications, Circuits and Systems2009 International Conference on Communications, Circuits and Systems》;20090918;第460-463页
基于图形识别的汉字笔画分类方法;赵青 等;《计算机技术与发展》;20091031;第14-17页

Also Published As

Publication number Publication date
CN106503706A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
Harouni et al. Online Persian/Arabic script classification without contextual information
CN105426919B (en) The image classification method of non-supervisory feature learning is instructed based on conspicuousness
CN101290659B (en) Hand-written recognition method based on assembled classifier
CN106228528B (en) A kind of multi-focus image fusing method based on decision diagram and rarefaction representation
CN104239902B (en) Hyperspectral image classification method based on non local similitude and sparse coding
CN106598920B (en) A kind of nearly word form classification method of stroke coding combination Chinese character dot matrix
JPH06243297A (en) Method and equipment for automatic handwritten character recognition using static and dynamic parameter
CN105139041A (en) Method and device for recognizing languages based on image
CN105893968A (en) Text-independent end-to-end handwriting recognition method based on deep learning
CN111401353A (en) Method, device and equipment for identifying mathematical formula
CN104850838A (en) Three-dimensional face recognition method based on expression invariant regions
CN106909946A (en) A kind of picking system of multi-modal fusion
CN106056082A (en) Video action recognition method based on sparse low-rank coding
CN106055653A (en) Video synopsis object retrieval method based on image semantic annotation
CN106503706B (en) The method of discrimination of Chinese character pattern cutting result correctness
CN109800746A (en) A kind of hand-written English document recognition methods based on CNN
CN105117740A (en) Font identification method and device
CN108664975A (en) A kind of hand-written Letter Identification Method of Uighur, system and electronic equipment
CN106650696A (en) Handwritten electrical element identification method based on singular value decomposition
CN115620322B (en) Method for identifying table structure of whole-line table based on key point detection
CN101655911B (en) Mode identification method based on immune antibody network
Obaidullah et al. Structural feature based approach for script identification from printed Indian document
CN103336830B (en) Image search method based on structure semantic histogram
JPH08508128A (en) Image classification method and apparatus using distribution map
CN101520839A (en) Human body detection method based on second-generation strip wave conversion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant