CN109299726A

CN109299726A - A kind of Chinese character pattern Similarity algorithm based on feature vector and stroke order coding

Info

Publication number: CN109299726A
Application number: CN201810860010.0A
Authority: CN
Inventors: 龙华; 祁俊辉; 邵玉斌; 彭艺
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-02-01

Abstract

The present invention relates to a kind of Chinese character pattern Similarity algorithm based on feature vector and stroke order coding, belongs to Chinese information processing technical field.The present invention utilizes Hanzi structure, profile, stroke, the features such as sequential write, establish Hanzi features vector data library and Chinese-character order of strokes coded data library, its Hanzi features vector sum Chinese-character order of strokes coded string is transferred to any two Chinese character, the font similarity based on Hanzi features vector is calculated by difference arithmetic, the font similarity based on Chinese-character order of strokes coding is calculated by Jaro-Winkler Distance algorithm, two similarities reflect the similarity degree of Chinese character from different aspect respectively, the advantage for drawing two kinds of algorithms merges it, obtain final similarity.Compared with prior art, the present invention mainly solving phenomena such as prior art poor accuracy, flexibility is poor, the accuracy for carrying out Chinese character pattern similarity calculation by computer at present is increased.

Description

A kind of Chinese character pattern Similarity algorithm based on feature vector and stroke order coding

Technical field

The present invention relates to a kind of Chinese character pattern Similarity algorithm based on feature vector and stroke order coding, belongs at Chinese information Manage technical field.

Background technique

Text is the main tool that the mankind carry out information interchange, but due to many Chinese characters there are body it is similar cause wrong knowledge, Mistake is known, so correctly distinguishing these confusing nearly word forms to Chinese teaching, Chinese editor, typesetting, Chinese machine recognition, the Chinese The business such as language broadcast are of great significance.

Currently, being broadly divided into two classes for the similar algorithm of Chinese character pattern: one kind is the basic information for obtaining Chinese character, such as word These data are generated mathematic(al) representation according to certain coding rule, recycled special by shape structure, stroke number, stroke order etc. Determine the font similarity that algorithm obtains Chinese character and then the processing to mathematic(al) representation；Another kind of is using image processing techniques Extract Hanzi features, contrast differences alienation feature.But these two kinds of methods have respective defect, if being needed using first kind method Some coefficients are set to balance final output result；If using the second class method, for the similarity calculation of some compounds As a result poor.

Summary of the invention

The technical problem to be solved by the present invention is to be directed to the limitation and deficiency of the prior art, provide a kind of based on feature vector It is dedicated to increasing to solve prior art poor accuracy, phenomena such as flexibility is poor with the Chinese character pattern Similarity algorithm of stroke order coding Add the accuracy for carrying out Chinese character pattern similarity calculation by computer at present.

The technical scheme is that a kind of Chinese character pattern Similarity algorithm based on feature vector and stroke order coding, specifically Step are as follows:

Step0.1: extracting picture corresponding to each Chinese character from TTC font file, i.e., Chinese character picture size be l × W (unit is pixel), amounts to N number of pixel；Using Chinese character picture as input source, character matrix corresponding to the Chinese character is generated I_l×w, the element value in the matrix is the gray value of the pixel；Definition ξ is binarization of gray value threshold value, carries out formula to matrix (1) binary conversion treatment shown in, later by matrix I_l×wIt is generated corresponding to the Chinese character according to rule from left to right, from top to bottom Feature vector { x₁,x₂,…,x_N}；All Chinese characters and the Hanzi features vector of generation are stored in database, set up Hanzi features to Measure database；

Step0.2: according to Chinese character five-stroke sequential write rule, being encoded to alphabetical a, b, c, d, e for horizontal, vertical, left, flick, folding, Generate stroke order coding character string x corresponding to the Chinese character₁x₂…x_z, wherein z is the stroke number of the Chinese character, x_iIt is the Chinese character i-th Stroke, and x_i∈{a,b,c,d,e},i∈[1,z]；All Chinese characters and the Chinese-character order of strokes coded string of generation are stored in Database sets up Chinese-character order of strokes coded data library；

Step1: note X, Y are two Chinese characters that will calculate font similarity, from Hanzi features vector data library respectively Transfer Hanzi features vector X:{ x corresponding to the two Chinese characters₁,x₂,…,x_NAnd Y:{ y₁,y₂,…,y_N, it is compiled from Chinese-character order of strokes Chinese-character order of strokes coded string str corresponding to the two Chinese characters is transferred respectively in code database_xAnd str_y；

Step2: by Hanzi features vector X:{ x₁,x₂,…,x_NAnd Y:{ y₁,y₂,…,y_NAs input, by difference arithmetic Acquire the font similarity Sim between Chinese character X, Y based on feature vector₁(X,Y)；

Step2.1: z is defined_i=x_i-y_i, i ∈ [1, N], generate Chinese character X, Y corresponding to difference feature vector

Step2.2: the font similarity between Chinese character X, Y based on feature vector is acquired by difference calculation formula (2) Sim₁(X,Y)；

Step3: by Chinese-character order of strokes coded string str_xAnd str_yAs input, calculated by Jaro-Winkler Distance Method acquires the font similarity Sim between Chinese character X, Y based on stroke order coding₂(X,Y)；

Step3.1: Chinese-character order of strokes coded string str is obtained_xAnd str_yLength len_xAnd len_y, and generate detection square Battle array I (X, Y)_lenx×leny；

Step3.2: match window value MW is calculated according to formula (3)；

Step3.3: by detection matrixAnd match window value MW calculates matching character according to dependency rule Number m and matching character transposition number n, and Chinese-character order of strokes coded string str is calculated according to formula (4)_xAnd str_yBetween Jaro Distance；

Step3.4: Chinese-character order of strokes coded string str is obtained_xAnd str_yLongest Common Substring str_xy, and obtain its length Spend len_xy, Chinese-character order of strokes coded string str is further calculated according to formula (5)_xAnd str_yBetween Jaro-Winkler Distance, the value are the font similarity Sim between Chinese character X, Y based on stroke order coding₂(X,Y)；

Wherein, b_tFor the threshold value for whether needing to further calculate, p is zoom factor；

Step4: setting the calculated similarity of Step2, Step3 step institute and corresponding to weight is respectively α, β, and weight α, β meet α The requirement of+β=1, by the font similarity Sim based on feature vector₁(X, Y) and weight α, the font based on stroke order coding are similar Spend Sim₂(X, Y) and weight β, by similarity blending algorithm, i.e. it is similar to calculate the final font between Chinese character X, Y for formula (6) It spends Sim (X, Y)；

Sim (X, Y)=Sim₁(X,Y)·α+Sim₂(X,Y)·β (6)

Further, in the step Step0.1, Chinese character picture size l × w is the Chinese Character by extracting in font file Body size determines；And character matrix I_l×wIn element value I (i, j), binarization of gray value threshold xi meet the requirements of formula (7).

0≤I(i,j),ξ≤255,i∈[1,l],j∈[1,w] (7)

Further, Chinese-character order of strokes coded string str in the step Step3.1_x、str_yLength len_x、len_yIt answers Meet the requirement of formula (8).

len_x,len_y∈N⁺ (8)

Further, the calculating that number of characters m is matched in the step Step3.3, if Chinese-character order of strokes coded string str_x And str_yMiddle identical characters difference distance is less than match window value MW, then is considered as the character match.It should be noted that in matching process In, the character being matched need to be excluded, if finding matching character, needs to jump out this time matching, carries out the matching of next character.And Calculating for matching character transposition number n, then need to see Chinese-character order of strokes coded string str_xAnd str_yIn for matching character set Whether sequence is consistent, if inconsistent, the half for the number that replaces is to match character transposition number n.In addition, matching number of characters m and Matching character transposition number n ought to meet the requirement of formula (9).

Further, threshold value b is further calculated described in step Step3.4_t, usual value is 0.7, can be according to practical inspection It surveys result to adjust by a small margin, primarily to improving detection accuracy；The zoom factor p, usual value are 0.1, can root Factually border testing result is done adjusts by a small margin, primarily to the case where avoiding final calculation result from being greater than 1 generation, but this method Newly-increased coded string str_xAnd str_yThe inverse of middle longest distanceImprove calculation formula hereinSo the value of zoom factor p on final calculation result influence and it is little.

Further, the font similarity Sim obtained in the step Step2 based on Hanzi features vector₁(X,Y)、 Font similarity Sim based on Chinese-character order of strokes coding obtained in the step Step3₂In (X, Y), the step Step4 The final font similarity Sim (X, Y) arrived, should meet the requirement of formula (10), i.e. font similarity Sim₁(X,Y)、Sim₂(X, Y), Sim (X, Y) reflects the similarity degree between two Chinese characters with the numerical value between one [0,1], and the bigger expression of numerical value is similar Degree is higher.

0≤Sim₁(X,Y),Sim₂(X,Y),Sim(X,Y)≤1 (10)

The beneficial effects of the present invention are: solving phenomena such as prior art poor accuracy, flexibility is poor, increase at present The accuracy of Chinese character pattern similarity calculation is carried out by computer.

Detailed description of the invention

Fig. 1 is flow diagram of the present invention；

Fig. 2 is that the present invention establishes database flow diagram；

Fig. 3 is the present invention refined surplus body Chinese character picture schematic diagram of Microsoft generated.

Specific embodiment

With reference to the accompanying drawings and detailed description, the invention will be further described.

Embodiment 1: as shown in Figure 1-3, a kind of Chinese character pattern Similarity algorithm based on feature vector and stroke order coding, utilizes The features such as Hanzi structure, profile, stroke, sequential write establish Hanzi features vector data library and Chinese-character order of strokes coded data library, Its Hanzi features vector sum Chinese-character order of strokes coded string is transferred to any two Chinese character, is calculated by difference arithmetic based on the Chinese The font similarity of word feature vector is calculated by Jaro-Winkler Distance algorithm based on Chinese-character order of strokes coding Font similarity, two similarities reflect the similarity degree of Chinese character from different aspect respectively, draw the advantage pair of two kinds of algorithms It is merged, and final similarity is obtained.

Specifically includes the following steps:

Specific: using the refined black TTC font of Microsoft as input source, the Chinese character picture size extracted is 64 × 64 pixels, Amount to N=4096 pixel, and takes binarization of gray value threshold xi=1；

Specific: note Chinese character X is " steel ", and Chinese character Y is " indium ", transfers the two respectively from Hanzi features vector data library Hanzi features vector corresponding to Chinese character, i.e.,

X=0,0,0 ..., 1,0,0 ..., 1,1,0 ..., 0,0,0 }

Y=0,0,0 ..., 0,1,0 ..., 1,0,1 ..., 0,0,0 }

In addition, transferring Chinese-character order of strokes code character corresponding to the two Chinese characters respectively from Chinese-character order of strokes coded data library String str_x=caaaebecd, str_y=caaaebeacda；

It is specific:

Step3.2: match window value MW is calculated according to formula (3)；

It is specific:

Dis_j=0.9394

It is specific: to take b_t=0.7, p=0.1, then Longest Common Substring len_xy=caaaebe, length len_xy=7；

Sim₂(X, Y)=Dis_jw=0.9779

Sim (X, Y)=Sim₁(X,Y)·α+Sim₂(X,Y)·β (6)

Weighting value α=0.5, β=0.5, fused rear final similarity are as follows:

By result above it can be shown that font similarity obtained by the final calculating of Chinese character " steel " and " indium " is 0.9188, phase The similarity (0.8596) obtained for feature vector is used alone, neither seems coarse, and relatively reasonable；Relative to individually making The similarity (0.9779) obtained with stroke order coding neither seems less boastful, and relatively meets based on human visual judgement Effect.

In addition, about similarity Sim₁(X,Y)、Sim₂Value α, β of (X, Y) corresponding weight should be carried out more with actual conditions Reasonable value after secondary detection, appropriate adjustment.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of Chinese character pattern Similarity algorithm based on feature vector and stroke order coding, it is characterised in that:

Step0.1: extracting picture corresponding to each Chinese character from TTC font file, i.e. Chinese character picture size is l × w, single Position is pixel, amounts to N number of pixel；Using Chinese character picture as input source, character matrix I corresponding to the Chinese character is generated_l×w, Element value in the matrix is the gray value of the pixel；Definition ξ is binarization of gray value threshold value, carries out formula (1) to matrix Shown binary conversion treatment, later by matrix I_l×wFeature corresponding to the Chinese character is generated according to rule from left to right, from top to bottom Vector { x₁,x₂,…,x_N}；All Chinese characters and the Hanzi features vector of generation are stored in database, set up Hanzi features vector number According to library；

Step0.2: according to Chinese character five-stroke sequential write rule, horizontal, vertical, left, flick, folding is encoded to alphabetical a, b, c, d, e, is generated Stroke order coding character string x corresponding to the Chinese character₁x₂…x_z, wherein z is the stroke number of the Chinese character, x_iFor i-th pen of the Chinese character It draws, and x_i∈{a,b,c,d,e},i∈[1,z]；All Chinese characters and the Chinese-character order of strokes coded string of generation are stored in data Chinese-character order of strokes coded data library is set up in library；

Step1: note X, Y are two Chinese characters that will calculate font similarity, are transferred respectively from Hanzi features vector data library Hanzi features vector X:{ x corresponding to the two Chinese characters₁,x₂,…,x_NAnd Y:{ y₁,y₂,…,y_N, from Chinese-character order of strokes coded number According to Chinese-character order of strokes coded string str corresponding to the two Chinese characters is transferred in library respectively_xAnd str_y；

Step2: by Hanzi features vector X:{ x₁,x₂,…,x_NAnd Y:{ y₁,y₂,…,y_NAs input, it is acquired by difference arithmetic Font similarity Sim between Chinese character X, Y based on feature vector₁(X,Y)；

Step2.2: the font similarity Sim between Chinese character X, Y based on feature vector is acquired by difference calculation formula (2)₁(X, Y)；

Step3: by Chinese-character order of strokes coded string str_xAnd str_yAs input, asked by Jaro-Winkler Distance algorithm Obtain the font similarity Sim between Chinese character X, Y based on stroke order coding₂(X,Y)；

Step3.1: Chinese-character order of strokes coded string str is obtained_xAnd str_yLength len_xAnd len_y, and generate detection matrix

Step3.2: match window value MW is calculated according to formula (3)；

Step3.3: by detection matrixAnd match window value MW, according to dependency rule, calculate matching number of characters m and Character transposition number n is matched, and calculates Chinese-character order of strokes coded string str according to formula (4)_xAnd str_yBetween Jaro Distance；

Step3.4: Chinese-character order of strokes coded string str is obtained_xAnd str_yLongest Common Substring str_xy, and obtain its length len_xy, Chinese-character order of strokes coded string str is further calculated according to formula (5)_xAnd str_yBetween Jaro-Winkler Distance, the value are the font similarity Sim between Chinese character X, Y based on stroke order coding₂(X,Y)；

Step4: setting the calculated similarity of Step2, Step3 step institute and corresponding to weight is respectively α, β, weight α, β meet alpha+beta= 1 requirement, by the font similarity Sim based on feature vector₁(X, Y) and weight α, the font similarity based on stroke order coding Sim₂(X, Y) and weight β, by similarity blending algorithm, i.e. formula (6) calculates the final font similarity between Chinese character X, Y Sim(X,Y)；

Sim (X, Y)=Sim₁(X,Y)·α+Sim₂(X,Y)·β (6)。

2. the Chinese character pattern Similarity algorithm according to claim 1 based on feature vector and stroke order coding, it is characterised in that: In the step Step0.1, Chinese character picture size l × w is determined by the Chinese character style size extracted in font file；And the Chinese Word matrix I_l×wIn element value I (i, j), binarization of gray value threshold xi meet the requirements of formula (7)；

0≤I(i,j),ξ≤255,i∈[1,l],j∈[1,w] (7)。

3. the Chinese character pattern Similarity algorithm according to claim 1 based on feature vector and stroke order coding, it is characterised in that: Chinese-character order of strokes coded string str in the step Step3.1_x、str_yLength len_x、len_yWanting for formula (8) should be met It asks:

len_x,len_y∈N⁺ (8)。

4. the Chinese character pattern Similarity algorithm according to claim 1 based on feature vector and stroke order coding, it is characterised in that: The calculating that number of characters m is matched in the step Step3.3, if Chinese-character order of strokes coded string str_xAnd str_yMiddle identical characters phase Gap is then considered as the character match from match window value MW is less than, and matching number of characters m and matching character transposition number n ought to meet The requirement of formula (9):

5. the Chinese character pattern Similarity algorithm according to claim 1 based on feature vector and stroke order coding, it is characterised in that: Font similarity Sim obtained in the step Step2 based on Hanzi features vector₁In (X, Y), the step Step3 The font similarity Sim based on Chinese-character order of strokes coding arrived₂Final font similarity obtained in (X, Y), the step Step4 Sim (X, Y) should meet the requirement of formula (10), i.e. font similarity Sim₁(X,Y)、Sim₂(X, Y), Sim (X, Y) are with one Numerical value between [0,1] reflects the similarity degree between two Chinese characters, and the bigger expression similarity degree of numerical value is higher；

0≤Sim₁(X,Y),Sim₂(X,Y),Sim(X,Y)≤1 (10)。