CN110717479A

CN110717479A - Tibetan computer font diversity expression method based on k-means clustering

Info

Publication number: CN110717479A
Application number: CN201911014677.XA
Authority: CN
Inventors: 车文刚; 苗晗
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-10-24
Filing date: 2019-10-24
Publication date: 2020-01-21

Abstract

The invention discloses a Tibetan computer font diversity expression method based on k-means clustering, and belongs to the field of Chinese information processing. The method comprises the following steps: 1) establishing a font model of a basic font + n discrete fonts; 2) preprocessing a Tibetan text image; 3) carrying out k-means clustering on a pixel matrix of a character image to form a basic font and n discrete fonts; 4) obtaining a distribution model with the basic font occupation ratio and the discrete font occurrence frequency satisfied according to the data; 5) determining a replacement algorithm; 6) obtaining a replacing font according to the basic font, the discrete font and a replacing algorithm; 7) and replacing the font in the Microsoft Himalaya edition scripture by the replacing font so as to obtain the Tibetan scripture with font diversity. The method combines the k-means clustering algorithm and the font diversity expression, can obtain the Tibetan manuscript with font diversity, and can realize the diversity expression of the Tibetan computer font.

Description

Tibetan computer font diversity expression method based on k-means clustering

Technical Field

The invention relates to the field of Chinese information processing, in particular to a Tibetan computer font diversity expression method based on k-means clustering.

Background

Tibetan history documents have traditionally been kept and distributed in woodcarving and rubbing, which have unique handwritten and engraved fonts. Conventionally, the collection and protection of historical Tibetan literature are carried out by a central government for a plurality of times, however, the research and development status of historical Tibetan literature is still not optimistic, the protection of historical Tibetan literature is mainly stopped at a storage protection stage at present, or is simply stored in a pattern scanning or vectorization pattern mode, or is carried out to improve the storage quality, and the mode keeps the diversity of manually engraved original fonts as much as possible, but cannot realize the basic function of computer character information processing.

Considerable progress has been made in the research on the information processing of Tibetan computers, and Tibetan has created a TTF word stock, wherein a typical font is Microsoft Himalaya font, which has the advantages of good information dissemination and good readability, and can be used to realize the basic functions of information processing of Tibetan computers, but it does not show the diversity of fonts engraved in the history literature of original Tibetan and the aesthetic sense of an engraved body and embody the art engraved in ancient civilization.

In the generation of handwritten Chinese fonts, some studies create a library of personal Chinese handwritten fonts from small sample sets or generate user handwritten style fonts from font style migrations. The generated handwritten Chinese characters have the writing style of a writer, but each character form of one Chinese character is the same, and the handwritten Chinese characters have no font diversity, so that the aesthetic feeling of writing of the writer and the diversity of the handwritten fonts cannot be reflected.

Based on the current research situation, the method for exploring the diversified fonts in the Tibetan classical literature as much as possible has great practical significance. The font diversity expression method can be expanded to the aspect of Chinese character handwriting, and changes the current situation that the font of each Chinese character is the same when the user handwriting font is generated.

Disclosure of Invention

The invention provides a Tibetan computer font diversity expression method based on k-means clustering, which is used for solving the problems that the processing of Tibetan by a computer at present can not only realize the basic function of computer font information processing, but also reserve the engraving font diversity of an original edition.

The technical scheme of the invention is as follows: a Tibetan language computer word diversity expression method based on k-means clustering comprises the following specific steps:

s1, establishing a font model of the basic font + n discrete font;

s2, preprocessing the Tibetan text image;

s3, carrying out k-means clustering on the pixel matrix of the character image to form a basic font and n discrete fonts;

s4, obtaining a distribution model with the basic font occupying ratio and the discrete font occurrence frequency satisfied according to the data;

s5, determining a replacement algorithm;

s6, obtaining a replacing font according to the basic font, the discrete font and the replacing algorithm;

and S7, replacing the font in the Microsoft Himalaya edition scripture by the replacing font so as to obtain the Tibetan scripture with font diversity.

The specific steps of step S1 are as follows:

s1.1, establishing a font model of the basic font + n discrete font.

The specific steps of step S2 are as follows:

s2.1, binaryzation is carried out on the Tibetan text image;

s2.2, segmenting the Tibetan text image by using a projection method and a connected domain method;

s2.3, identifying characters in the character image;

s2.4, judging whether the characters in the image are combined characters, if so, decomposing the combined characters by using a graph font structure decomposition method, and if not, skipping the step;

s2.5, classifying the character image with the character in the image as one character into a class Y, wherein R samples are Y₁、Y₂...Y_R。

The specific steps of step S3 are as follows:

s3.1, one sample Y₁Carrying out k-means clustering on the pixel matrix of the character image;

s3.1.1 at sample Y₁Pixel matrix P of character image_1tTo select k initial centers C₁＝(C₁₁,C₁₂...C_1k) Wherein k is n + 1;

s3.1.2, statistic Y₁Number of samples N in a sample₁；

S3.1.3.1, calculating P according to the following formula_1tAnd C₁Distance d ═ d₁₁,d₁₂...d_1k)：

Let two pixel matrices M₁And M₂Are respectively as

The distance between them

S3.1.3.2 at d₁₁,d₁₂...d_1kOf the minimum d_1gThen find d_1gCorresponding to C_1gAnd is combined with P_1tFall into C_1gIn a cluster at the center;

s3.1.3.3, repeat S3.1.3.1, S3.1.3.2, grouping all pixel matrices into clusters;

s3.1.4, recalculating the center of each cluster according to the following formula and updating:

recording the pixel matrix M in the cluster needing to update the cluster center₁、M₂...M_iAre respectively as

The updated cluster center

S3.1.5, repeat S3.1.3.1, S3.1.3.2, S3.1.3.3, S3.1.4 until the center of each cluster no longer changes;

s3.1.6, calculating the distance D between the pixel matrix in each cluster and the cluster center by using the formula (1), taking the pixel matrix corresponding to the minimum value of D in each cluster, and taking the pixel matrix as the new cluster center of the cluster to obtain a new clusterCluster center C of₁’＝(C₁₁’,C₁₂’...C_1k’)；

S3.1.7, counting the number T of pixel matrixes in k clusters₁₁,T₁₂...T_1kTaking T_1s＝max(T₁₁,T₁₂...T_1k) And find T_1sCorresponding to C_1s’；

S3.1.8, get C_1s' corresponding glyph is sample Y₁The font corresponding to the center of the remaining cluster is a sample Y₁The glyphs corresponding to the centers of the remaining clusters are randomly corresponding to a discrete glyph 0, a discrete glyph 1.

S3.2, circulating S3.1 and mixing Y₁、Y₂...Y_RCarrying out k-means clustering on pixel matrixes of the character images to obtain Y₁、Y₂...Y_RBase glyph and discrete glyph 0- (k-2);

s3.3, sample Y₁、Y₂...Y_RThe basic font is imported into the FontCreator software to obtain the basic font, and the sample Y is used₁、Y₂...Y_RThe discrete font 0 can be obtained by introducing the discrete font 0 into the FontCreator software, and the like, and the sample Y is obtained₁、Y₂...Y_RThe discrete font (k-2) is imported into the FontCreator software to obtain the discrete font (k-2).

The specific steps of step S4 are as follows:

s4.1, calculating sample Y_jThe ratio of basic fonts (1 ≦ j ≦ R) to all fonts

S4.2, calculating the ratio of basic fonts in all fonts

S4.3, obtaining the character pixel matrix number T in the cluster by clustering the occurrence frequency of each discrete font_j1,T_j2...T_j(k-1)Determination of general conditionsThe number of pixel matrixes in each cluster is different, so that the distribution model which is assumed to satisfy the occurrence frequency of each discrete font is normal distribution phi_μ,σ(x)。

The specific steps of step S5 are as follows:

s5.1, the determined replacement algorithm is as follows:

s5.1.1, counting the total number q of characters in a text of a scripture;

S5.1.2、for(v＝1,v≤q,v++)

s5.1.2.1, randomly generating an integer r of 0-999₁；

S5.1.2.2, if r₁<1000 c', then using basic font for the v character, otherwise, using function of normal distribution to generate random number and rounding function to generate a random integer r of 0-9₂And using the discrete font for the v character as the discrete font r₂。

The specific steps of step S6 are as follows:

and S6.1, obtaining the replacement font according to the basic font, the discrete font and the replacement algorithm.

The specific steps of step S7 are as follows:

and S7.1, replacing the font in the Microsoft Himalaya edition scripture by using the replacing font, thereby obtaining the Tibetan scripture with font diversity.

The invention has the beneficial effects that:

1. according to the Tibetan language computer font diversity expression method based on k-means clustering, a clustering algorithm is applied to font diversity expression, fonts representing engraving styles of engravers can be effectively obtained, and the aesthetic feeling of the engraving styles of Tibetan language engravers is shown.

2. The Tibetan language computer font diversity expression method based on k-means clustering can effectively solve the problems that the processing of Tibetan language by a computer at present can not only realize the basic function of computer font information processing, but also reserve the engraved font diversity of an original edition.

3. The Tibetan language computer font diversity expression method based on k-means clustering can effectively solve the problem that each font of the same Tibetan language character is the same when a handwritten form or a carving body is generated.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

FIG. 2 is a diagram of a font model according to an embodiment of the present invention.

Fig. 3 is a flow chart of preprocessing Tibetan text images according to an embodiment of the present invention.

FIG. 4 is a process diagram of a method for decomposing a graph font structure according to an embodiment of the present invention.

FIG. 5 is Y of an embodiment of the present invention₁And (4) a flow chart of a sample k-means clustering algorithm.

Fig. 6 is an alternative algorithm flow diagram of an embodiment of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

FIG. 1 is a flow chart of the method of the present invention:

s1, establishing a font model of the basic font + n discrete font;

s2, preprocessing the Tibetan text image;

s3, carrying out k-means clustering on the pixel matrix of the character image to form basic and discrete fonts;

s4, obtaining a distribution model satisfying the basic font occupation ratio and the discrete font occurrence frequency according to the data;

s5, determining a replacement algorithm;

As a further aspect of the present invention, the specific steps of step S1 are as follows:

s1.1, establishing a font model of the basic font + n discrete font shown in the figure 2.

In the text of the Tibetan carving plate, the carving style of the same carver is fixed in most cases in the actual carving process, but when the carving style is influenced by internal or external influences, such as mood, hand muscles and the like, the carved characters deviate from the carving style. According to the actual situation, characters carved by the same carving operator are divided into two categories, for example, in the second figure, one category is characters within the carving style, namely basic font characters; the other type is a character outside the carving style, namely a discrete font character, but the discrete font can be various for the same carving engineer, so that the font of the Tibetan character actually carved can be approximately regarded as the sum of one basic font and various discrete fonts.

As a further aspect of the present invention, the specific steps of step S2 are as follows:

s2.1, binaryzation is carried out on the Tibetan text image;

s2.3, identifying characters in the character image;

As shown in fig. 3, firstly, scanning the engraved manuscript to obtain a Tibetan text image, and in order to facilitate subsequent operations, performing binarization processing on the Tibetan text image, wherein the core of the binarization processing is to set a pixel larger than a threshold T in an image pixel gray-scale matrix to be white (255) and a pixel smaller than the threshold T to be black (0), and the threshold T is automatically obtained because the Tibetan text image is red and white in color.

Secondly, carrying out rough line cutting on the Tibetan text image by using a line projection method, wherein the core is to carry out horizontal projection calculation on a text image pixel matrix, and then finding out the lowest point of a curve as a cutting point to carry out line cutting on the text image. Because the upper and lower lines of the Tibetan text can not be completely segmented by a line segmentation method, the Tibetan characters are segmented by using a connected domain algorithm after line rough segmentation, and each independent closed region in the image is found.

Thirdly, judging whether the divided character image is a combined digital image, if so, using a graphic font structure decomposition method to decompose the combined character, and if not, not needing the operation. The process of the graphic font structure decomposition method is shown in FIG. 4.

And finally, classifying character images of which the characters in the images are one character into one type of sample.

As a further aspect of the present invention, the specific steps of step S3 are as follows:

s3.1, sample Y₁Performing k-means clustering on the pixel matrix of the character image, wherein the process is shown in FIG. 5;

s3.1.2, statistic Y₁Number of samples N in a sample₁；

Let two pixel matrices M₁And M₂Are respectively as

The distance between them

s3.1.3.3, repeat S3.1.3.1, S3.1.3.2, grouping all pixel matrices into clusters;

The updated cluster center

s3.1.6, calculating the distance D between the pixel matrix in each cluster and the cluster center by using the formula (1), taking the pixel matrix corresponding to the minimum value of D in each cluster, and taking the pixel matrix as the new cluster center of the cluster to obtain a new cluster center C₁’＝(C₁₁’,C₁₂’...C_1k’)；

As a further aspect of the present invention, the specific steps of step S4 are as follows:

S4.2, calculating the ratio of basic fonts in all fonts

S4.3, obtaining the character pixel matrix number T in the cluster by clustering the occurrence frequency of each discrete font_j1,T_j2...T_j(k-1)Determining that the number of pixel matrixes in each cluster is different under the general condition, and therefore, assuming that a distribution model meeting the occurrence frequency of each discrete font is normal distribution phi_μ,σ(x)。

In a normal distribution, if X is equal to N (mu, sigma)²) In particular, the following are:

P(μ-σ＜X≤μ+σ)＝0.6826

P(μ-2σ＜X≤μ+2σ)＝0.9544

P(μ-3σ＜X≤μ+3σ)＝0.9974

the normal population almost always takes values within the interval (μ -3 σ, μ +3 σ). The probability of taking values outside this interval is only 0.0026, which is generally considered to be almost impossible to occur in one experiment. In practical applications, it is generally considered that the normal distribution N (mu, sigma) is obeyed²) The random variable X of (u-3 σ, u +3 σ) takes only values between (u-3 σ, u +3 σ), and is referred to as the 3 σ principle for short.

According to the 3 sigma principle of normal distribution, the mean values mu and sigma of normal distribution should satisfy

Get

Then normally distributed

As a further aspect of the present invention, the specific steps of step S5 are as follows:

s5.1, the determined replacement algorithm is shown in FIG. 6:

s5.1.1, counting the total number q of characters in a text of a scripture;

S5.1.2、for(v＝1,v≤q,v++)

s5.1.2.1, randomly generating an integer r of 0-999₁；

As a further aspect of the present invention, the specific steps of step S6 are as follows:

As a further aspect of the present invention, the specific steps of step S7 are as follows:

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A Tibetan computer word diversity expression method based on k-means clustering is characterized by comprising the following specific steps:

s1, establishing a font model of the basic font + n discrete font;

s2, preprocessing the Tibetan text image;

s5, determining a replacement algorithm;

2. The method as claimed in claim 1, wherein the step S1 comprises the following steps:

s1.1, establishing a font model of the basic font + n discrete font.

3. The method as claimed in claim 1, wherein the step S2 comprises the following steps:

s2.1, binaryzation is carried out on the Tibetan text image;

s2.3, identifying characters in the character image;

4. The method as claimed in claim 1, wherein the step S3 comprises the following steps:

s3.1.2, statistic Y₁Number of samples N in a sample₁；

Let two pixel matrices M₁And M₂Are respectively as

The distance between them

s3.1.3.3, repeat S3.1.3.1, S3.1.3.2, grouping all pixel matrices into clusters;

The updated cluster center

5. The method as claimed in claim 1, wherein the step S4 comprises the following steps:

S4.1、calculating a sample Y_jThe ratio of basic fonts (1 ≦ j ≦ R) to all fonts

S4.2, calculating the ratio of basic fonts in all fonts

6. The method as claimed in claim 1, wherein the step S5 comprises the following steps:

s5.1, the determined replacement algorithm is as follows:

s5.1.1, counting the total number q of characters in a text of a scripture;

S5.1.2、for(v＝1,v≤q,v++)

s5.1.2.1, randomly generating an integer r of 0-999₁；

7. The method as claimed in claim 1, wherein the step S6 comprises the following steps:

8. The method as claimed in claim 1, wherein the step S7 comprises the following steps: