CN110717479A - Tibetan computer font diversity expression method based on k-means clustering - Google Patents

Tibetan computer font diversity expression method based on k-means clustering Download PDF

Info

Publication number
CN110717479A
CN110717479A CN201911014677.XA CN201911014677A CN110717479A CN 110717479 A CN110717479 A CN 110717479A CN 201911014677 A CN201911014677 A CN 201911014677A CN 110717479 A CN110717479 A CN 110717479A
Authority
CN
China
Prior art keywords
font
discrete
cluster
tibetan
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911014677.XA
Other languages
Chinese (zh)
Inventor
车文刚
苗晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201911014677.XA priority Critical patent/CN110717479A/en
Publication of CN110717479A publication Critical patent/CN110717479A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • G06T11/203Drawing of straight lines or curves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/28Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
    • G06V30/293Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of characters other than Kanji, Hiragana or Katakana

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Controls And Circuits For Display Device (AREA)

Abstract

The invention discloses a Tibetan computer font diversity expression method based on k-means clustering, and belongs to the field of Chinese information processing. The method comprises the following steps: 1) establishing a font model of a basic font + n discrete fonts; 2) preprocessing a Tibetan text image; 3) carrying out k-means clustering on a pixel matrix of a character image to form a basic font and n discrete fonts; 4) obtaining a distribution model with the basic font occupation ratio and the discrete font occurrence frequency satisfied according to the data; 5) determining a replacement algorithm; 6) obtaining a replacing font according to the basic font, the discrete font and a replacing algorithm; 7) and replacing the font in the Microsoft Himalaya edition scripture by the replacing font so as to obtain the Tibetan scripture with font diversity. The method combines the k-means clustering algorithm and the font diversity expression, can obtain the Tibetan manuscript with font diversity, and can realize the diversity expression of the Tibetan computer font.

Description

Tibetan computer font diversity expression method based on k-means clustering
Technical Field
The invention relates to the field of Chinese information processing, in particular to a Tibetan computer font diversity expression method based on k-means clustering.
Background
Tibetan history documents have traditionally been kept and distributed in woodcarving and rubbing, which have unique handwritten and engraved fonts. Conventionally, the collection and protection of historical Tibetan literature are carried out by a central government for a plurality of times, however, the research and development status of historical Tibetan literature is still not optimistic, the protection of historical Tibetan literature is mainly stopped at a storage protection stage at present, or is simply stored in a pattern scanning or vectorization pattern mode, or is carried out to improve the storage quality, and the mode keeps the diversity of manually engraved original fonts as much as possible, but cannot realize the basic function of computer character information processing.
Considerable progress has been made in the research on the information processing of Tibetan computers, and Tibetan has created a TTF word stock, wherein a typical font is Microsoft Himalaya font, which has the advantages of good information dissemination and good readability, and can be used to realize the basic functions of information processing of Tibetan computers, but it does not show the diversity of fonts engraved in the history literature of original Tibetan and the aesthetic sense of an engraved body and embody the art engraved in ancient civilization.
In the generation of handwritten Chinese fonts, some studies create a library of personal Chinese handwritten fonts from small sample sets or generate user handwritten style fonts from font style migrations. The generated handwritten Chinese characters have the writing style of a writer, but each character form of one Chinese character is the same, and the handwritten Chinese characters have no font diversity, so that the aesthetic feeling of writing of the writer and the diversity of the handwritten fonts cannot be reflected.
Based on the current research situation, the method for exploring the diversified fonts in the Tibetan classical literature as much as possible has great practical significance. The font diversity expression method can be expanded to the aspect of Chinese character handwriting, and changes the current situation that the font of each Chinese character is the same when the user handwriting font is generated.
Disclosure of Invention
The invention provides a Tibetan computer font diversity expression method based on k-means clustering, which is used for solving the problems that the processing of Tibetan by a computer at present can not only realize the basic function of computer font information processing, but also reserve the engraving font diversity of an original edition.
The technical scheme of the invention is as follows: a Tibetan language computer word diversity expression method based on k-means clustering comprises the following specific steps:
s1, establishing a font model of the basic font + n discrete font;
s2, preprocessing the Tibetan text image;
s3, carrying out k-means clustering on the pixel matrix of the character image to form a basic font and n discrete fonts;
s4, obtaining a distribution model with the basic font occupying ratio and the discrete font occurrence frequency satisfied according to the data;
s5, determining a replacement algorithm;
s6, obtaining a replacing font according to the basic font, the discrete font and the replacing algorithm;
and S7, replacing the font in the Microsoft Himalaya edition scripture by the replacing font so as to obtain the Tibetan scripture with font diversity.
The specific steps of step S1 are as follows:
s1.1, establishing a font model of the basic font + n discrete font.
The specific steps of step S2 are as follows:
s2.1, binaryzation is carried out on the Tibetan text image;
s2.2, segmenting the Tibetan text image by using a projection method and a connected domain method;
s2.3, identifying characters in the character image;
s2.4, judging whether the characters in the image are combined characters, if so, decomposing the combined characters by using a graph font structure decomposition method, and if not, skipping the step;
s2.5, classifying the character image with the character in the image as one character into a class Y, wherein R samples are Y1、Y2...YR
The specific steps of step S3 are as follows:
s3.1, one sample Y1Carrying out k-means clustering on the pixel matrix of the character image;
s3.1.1 at sample Y1Pixel matrix P of character image1tTo select k initial centers C1=(C11,C12...C1k) Wherein k is n + 1;
s3.1.2, statistic Y1Number of samples N in a sample1
S3.1.3.1, calculating P according to the following formula1tAnd C1Distance d ═ d11,d12...d1k):
Let two pixel matrices M1And M2Are respectively as
The distance between them
Figure BDA0002245311920000031
S3.1.3.2 at d11,d12...d1kOf the minimum d1gThen find d1gCorresponding to C1gAnd is combined with P1tFall into C1gIn a cluster at the center;
s3.1.3.3, repeat S3.1.3.1, S3.1.3.2, grouping all pixel matrices into clusters;
s3.1.4, recalculating the center of each cluster according to the following formula and updating:
recording the pixel matrix M in the cluster needing to update the cluster center1、M2...MiAre respectively as
Figure BDA0002245311920000032
The updated cluster center
S3.1.5, repeat S3.1.3.1, S3.1.3.2, S3.1.3.3, S3.1.4 until the center of each cluster no longer changes;
s3.1.6, calculating the distance D between the pixel matrix in each cluster and the cluster center by using the formula (1), taking the pixel matrix corresponding to the minimum value of D in each cluster, and taking the pixel matrix as the new cluster center of the cluster to obtain a new clusterCluster center C of1’=(C11’,C12’...C1k’);
S3.1.7, counting the number T of pixel matrixes in k clusters11,T12...T1kTaking T1s=max(T11,T12...T1k) And find T1sCorresponding to C1s’;
S3.1.8, get C1s' corresponding glyph is sample Y1The font corresponding to the center of the remaining cluster is a sample Y1The glyphs corresponding to the centers of the remaining clusters are randomly corresponding to a discrete glyph 0, a discrete glyph 1.
S3.2, circulating S3.1 and mixing Y1、Y2...YRCarrying out k-means clustering on pixel matrixes of the character images to obtain Y1、Y2...YRBase glyph and discrete glyph 0- (k-2);
s3.3, sample Y1、Y2...YRThe basic font is imported into the FontCreator software to obtain the basic font, and the sample Y is used1、Y2...YRThe discrete font 0 can be obtained by introducing the discrete font 0 into the FontCreator software, and the like, and the sample Y is obtained1、Y2...YRThe discrete font (k-2) is imported into the FontCreator software to obtain the discrete font (k-2).
The specific steps of step S4 are as follows:
s4.1, calculating sample YjThe ratio of basic fonts (1 ≦ j ≦ R) to all fonts
Figure BDA0002245311920000041
S4.2, calculating the ratio of basic fonts in all fonts
Figure BDA0002245311920000042
S4.3, obtaining the character pixel matrix number T in the cluster by clustering the occurrence frequency of each discrete fontj1,Tj2...Tj(k-1)Determination of general conditionsThe number of pixel matrixes in each cluster is different, so that the distribution model which is assumed to satisfy the occurrence frequency of each discrete font is normal distribution phiμ,σ(x)。
The specific steps of step S5 are as follows:
s5.1, the determined replacement algorithm is as follows:
s5.1.1, counting the total number q of characters in a text of a scripture;
S5.1.2、for(v=1,v≤q,v++)
s5.1.2.1, randomly generating an integer r of 0-9991
S5.1.2.2, if r1<1000 c', then using basic font for the v character, otherwise, using function of normal distribution to generate random number and rounding function to generate a random integer r of 0-92And using the discrete font for the v character as the discrete font r2
The specific steps of step S6 are as follows:
and S6.1, obtaining the replacement font according to the basic font, the discrete font and the replacement algorithm.
The specific steps of step S7 are as follows:
and S7.1, replacing the font in the Microsoft Himalaya edition scripture by using the replacing font, thereby obtaining the Tibetan scripture with font diversity.
The invention has the beneficial effects that:
1. according to the Tibetan language computer font diversity expression method based on k-means clustering, a clustering algorithm is applied to font diversity expression, fonts representing engraving styles of engravers can be effectively obtained, and the aesthetic feeling of the engraving styles of Tibetan language engravers is shown.
2. The Tibetan language computer font diversity expression method based on k-means clustering can effectively solve the problems that the processing of Tibetan language by a computer at present can not only realize the basic function of computer font information processing, but also reserve the engraved font diversity of an original edition.
3. The Tibetan language computer font diversity expression method based on k-means clustering can effectively solve the problem that each font of the same Tibetan language character is the same when a handwritten form or a carving body is generated.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention.
FIG. 2 is a diagram of a font model according to an embodiment of the present invention.
Fig. 3 is a flow chart of preprocessing Tibetan text images according to an embodiment of the present invention.
FIG. 4 is a process diagram of a method for decomposing a graph font structure according to an embodiment of the present invention.
FIG. 5 is Y of an embodiment of the present invention1And (4) a flow chart of a sample k-means clustering algorithm.
Fig. 6 is an alternative algorithm flow diagram of an embodiment of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
FIG. 1 is a flow chart of the method of the present invention:
s1, establishing a font model of the basic font + n discrete font;
s2, preprocessing the Tibetan text image;
s3, carrying out k-means clustering on the pixel matrix of the character image to form basic and discrete fonts;
s4, obtaining a distribution model satisfying the basic font occupation ratio and the discrete font occurrence frequency according to the data;
s5, determining a replacement algorithm;
s6, obtaining a replacing font according to the basic font, the discrete font and the replacing algorithm;
and S7, replacing the font in the Microsoft Himalaya edition scripture by the replacing font so as to obtain the Tibetan scripture with font diversity.
As a further aspect of the present invention, the specific steps of step S1 are as follows:
s1.1, establishing a font model of the basic font + n discrete font shown in the figure 2.
In the text of the Tibetan carving plate, the carving style of the same carver is fixed in most cases in the actual carving process, but when the carving style is influenced by internal or external influences, such as mood, hand muscles and the like, the carved characters deviate from the carving style. According to the actual situation, characters carved by the same carving operator are divided into two categories, for example, in the second figure, one category is characters within the carving style, namely basic font characters; the other type is a character outside the carving style, namely a discrete font character, but the discrete font can be various for the same carving engineer, so that the font of the Tibetan character actually carved can be approximately regarded as the sum of one basic font and various discrete fonts.
As a further aspect of the present invention, the specific steps of step S2 are as follows:
s2.1, binaryzation is carried out on the Tibetan text image;
s2.2, segmenting the Tibetan text image by using a projection method and a connected domain method;
s2.3, identifying characters in the character image;
s2.4, judging whether the characters in the image are combined characters, if so, decomposing the combined characters by using a graph font structure decomposition method, and if not, skipping the step;
s2.5, classifying the character image with the character in the image as one character into a class Y, wherein R samples are Y1、Y2...YR
As shown in fig. 3, firstly, scanning the engraved manuscript to obtain a Tibetan text image, and in order to facilitate subsequent operations, performing binarization processing on the Tibetan text image, wherein the core of the binarization processing is to set a pixel larger than a threshold T in an image pixel gray-scale matrix to be white (255) and a pixel smaller than the threshold T to be black (0), and the threshold T is automatically obtained because the Tibetan text image is red and white in color.
Secondly, carrying out rough line cutting on the Tibetan text image by using a line projection method, wherein the core is to carry out horizontal projection calculation on a text image pixel matrix, and then finding out the lowest point of a curve as a cutting point to carry out line cutting on the text image. Because the upper and lower lines of the Tibetan text can not be completely segmented by a line segmentation method, the Tibetan characters are segmented by using a connected domain algorithm after line rough segmentation, and each independent closed region in the image is found.
Thirdly, judging whether the divided character image is a combined digital image, if so, using a graphic font structure decomposition method to decompose the combined character, and if not, not needing the operation. The process of the graphic font structure decomposition method is shown in FIG. 4.
And finally, classifying character images of which the characters in the images are one character into one type of sample.
As a further aspect of the present invention, the specific steps of step S3 are as follows:
s3.1, sample Y1Performing k-means clustering on the pixel matrix of the character image, wherein the process is shown in FIG. 5;
s3.1.1 at sample Y1Pixel matrix P of character image1tTo select k initial centers C1=(C11,C12...C1k) Wherein k is n + 1;
s3.1.2, statistic Y1Number of samples N in a sample1
S3.1.3.1, calculating P according to the following formula1tAnd C1Distance d ═ d11,d12...d1k):
Let two pixel matrices M1And M2Are respectively as
Figure BDA0002245311920000071
The distance between them
Figure BDA0002245311920000072
S3.1.3.2 at d11,d12...d1kOf the minimum d1gThen find d1gCorresponding to C1gAnd is combined with P1tFall into C1gIn a cluster at the center;
s3.1.3.3, repeat S3.1.3.1, S3.1.3.2, grouping all pixel matrices into clusters;
s3.1.4, recalculating the center of each cluster according to the following formula and updating:
recording the pixel matrix M in the cluster needing to update the cluster center1、M2...MiAre respectively as
Figure BDA0002245311920000073
The updated cluster center
S3.1.5, repeat S3.1.3.1, S3.1.3.2, S3.1.3.3, S3.1.4 until the center of each cluster no longer changes;
s3.1.6, calculating the distance D between the pixel matrix in each cluster and the cluster center by using the formula (1), taking the pixel matrix corresponding to the minimum value of D in each cluster, and taking the pixel matrix as the new cluster center of the cluster to obtain a new cluster center C1’=(C11’,C12’...C1k’);
S3.1.7, counting the number T of pixel matrixes in k clusters11,T12...T1kTaking T1s=max(T11,T12...T1k) And find T1sCorresponding to C1s’;
S3.1.8, get C1s' corresponding glyph is sample Y1The font corresponding to the center of the remaining cluster is a sample Y1The glyphs corresponding to the centers of the remaining clusters are randomly corresponding to a discrete glyph 0, a discrete glyph 1.
S3.2, circulating S3.1 and mixing Y1、Y2...YRCarrying out k-means clustering on pixel matrixes of the character images to obtain Y1、Y2...YRBase glyph and discrete glyph 0- (k-2);
s3.3, sample Y1、Y2...YRThe basic font is imported into the FontCreator software to obtain the basic font, and the sample Y is used1、Y2...YRThe discrete font 0 can be obtained by introducing the discrete font 0 into the FontCreator software, and the like, and the sample Y is obtained1、Y2...YRThe discrete font (k-2) is imported into the FontCreator software to obtain the discrete font (k-2).
As a further aspect of the present invention, the specific steps of step S4 are as follows:
s4.1, calculating sample YjThe ratio of basic fonts (1 ≦ j ≦ R) to all fonts
Figure BDA0002245311920000081
S4.2, calculating the ratio of basic fonts in all fonts
Figure BDA0002245311920000082
S4.3, obtaining the character pixel matrix number T in the cluster by clustering the occurrence frequency of each discrete fontj1,Tj2...Tj(k-1)Determining that the number of pixel matrixes in each cluster is different under the general condition, and therefore, assuming that a distribution model meeting the occurrence frequency of each discrete font is normal distribution phiμ,σ(x)。
In a normal distribution, if X is equal to N (mu, sigma)2) In particular, the following are:
P(μ-σ<X≤μ+σ)=0.6826
P(μ-2σ<X≤μ+2σ)=0.9544
P(μ-3σ<X≤μ+3σ)=0.9974
the normal population almost always takes values within the interval (μ -3 σ, μ +3 σ). The probability of taking values outside this interval is only 0.0026, which is generally considered to be almost impossible to occur in one experiment. In practical applications, it is generally considered that the normal distribution N (mu, sigma) is obeyed2) The random variable X of (u-3 σ, u +3 σ) takes only values between (u-3 σ, u +3 σ), and is referred to as the 3 σ principle for short.
According to the 3 sigma principle of normal distribution, the mean values mu and sigma of normal distribution should satisfy
Figure BDA0002245311920000083
Get
Figure BDA0002245311920000084
Then normally distributed
Figure BDA0002245311920000085
As a further aspect of the present invention, the specific steps of step S5 are as follows:
s5.1, the determined replacement algorithm is shown in FIG. 6:
s5.1.1, counting the total number q of characters in a text of a scripture;
S5.1.2、for(v=1,v≤q,v++)
s5.1.2.1, randomly generating an integer r of 0-9991
S5.1.2.2, if r1<1000 c', then using basic font for the v character, otherwise, using function of normal distribution to generate random number and rounding function to generate a random integer r of 0-92And using the discrete font for the v character as the discrete font r2
As a further aspect of the present invention, the specific steps of step S6 are as follows:
and S6.1, obtaining the replacement font according to the basic font, the discrete font and the replacement algorithm.
As a further aspect of the present invention, the specific steps of step S7 are as follows:
and S7.1, replacing the font in the Microsoft Himalaya edition scripture by using the replacing font, thereby obtaining the Tibetan scripture with font diversity.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (8)

1. A Tibetan computer word diversity expression method based on k-means clustering is characterized by comprising the following specific steps:
s1, establishing a font model of the basic font + n discrete font;
s2, preprocessing the Tibetan text image;
s3, carrying out k-means clustering on the pixel matrix of the character image to form a basic font and n discrete fonts;
s4, obtaining a distribution model with the basic font occupying ratio and the discrete font occurrence frequency satisfied according to the data;
s5, determining a replacement algorithm;
s6, obtaining a replacing font according to the basic font, the discrete font and the replacing algorithm;
and S7, replacing the font in the Microsoft Himalaya edition scripture by the replacing font so as to obtain the Tibetan scripture with font diversity.
2. The method as claimed in claim 1, wherein the step S1 comprises the following steps:
s1.1, establishing a font model of the basic font + n discrete font.
3. The method as claimed in claim 1, wherein the step S2 comprises the following steps:
s2.1, binaryzation is carried out on the Tibetan text image;
s2.2, segmenting the Tibetan text image by using a projection method and a connected domain method;
s2.3, identifying characters in the character image;
s2.4, judging whether the characters in the image are combined characters, if so, decomposing the combined characters by using a graph font structure decomposition method, and if not, skipping the step;
s2.5, classifying the character image with the character in the image as one character into a class Y, wherein R samples are Y1、Y2...YR
4. The method as claimed in claim 1, wherein the step S3 comprises the following steps:
s3.1, one sample Y1Carrying out k-means clustering on the pixel matrix of the character image;
s3.1.1 at sample Y1Pixel matrix P of character image1tTo select k initial centers C1=(C11,C12...C1k) Wherein k is n + 1;
s3.1.2, statistic Y1Number of samples N in a sample1
S3.1.3.1, calculating P according to the following formula1tAnd C1Distance d ═ d11,d12...d1k):
Let two pixel matrices M1And M2Are respectively as
Figure FDA0002245311910000021
The distance between them
S3.1.3.2 at d11,d12...d1kOf the minimum d1gThen find d1gCorresponding to C1gAnd is combined with P1tFall into C1gIn a cluster at the center;
s3.1.3.3, repeat S3.1.3.1, S3.1.3.2, grouping all pixel matrices into clusters;
s3.1.4, recalculating the center of each cluster according to the following formula and updating:
recording the pixel matrix M in the cluster needing to update the cluster center1、M2...MiAre respectively as
Figure FDA0002245311910000023
The updated cluster center
Figure FDA0002245311910000024
S3.1.5, repeat S3.1.3.1, S3.1.3.2, S3.1.3.3, S3.1.4 until the center of each cluster no longer changes;
s3.1.6, calculating the distance D between the pixel matrix in each cluster and the cluster center by using the formula (1), taking the pixel matrix corresponding to the minimum value of D in each cluster, and taking the pixel matrix as the new cluster center of the cluster to obtain a new cluster center C1’=(C11’,C12’...C1k’);
S3.1.7, counting the number T of pixel matrixes in k clusters11,T12...T1kTaking T1s=max(T11,T12...T1k) And find T1sCorresponding to C1s’;
S3.1.8, get C1s' corresponding glyph is sample Y1The font corresponding to the center of the remaining cluster is a sample Y1The glyphs corresponding to the centers of the remaining clusters are randomly corresponding to a discrete glyph 0, a discrete glyph 1.
S3.2, circulating S3.1 and mixing Y1、Y2...YRCarrying out k-means clustering on pixel matrixes of the character images to obtain Y1、Y2...YRBase glyph and discrete glyph 0- (k-2);
s3.3, sample Y1、Y2...YRThe basic font is imported into the FontCreator software to obtain the basic font, and the sample Y is used1、Y2...YRThe discrete font 0 can be obtained by introducing the discrete font 0 into the FontCreator software, and the like, and the sample Y is obtained1、Y2...YRThe discrete font (k-2) is imported into the FontCreator software to obtain the discrete font (k-2).
5. The method as claimed in claim 1, wherein the step S4 comprises the following steps:
S4.1、calculating a sample YjThe ratio of basic fonts (1 ≦ j ≦ R) to all fonts
Figure FDA0002245311910000031
S4.2, calculating the ratio of basic fonts in all fonts
Figure FDA0002245311910000032
S4.3, obtaining the character pixel matrix number T in the cluster by clustering the occurrence frequency of each discrete fontj1,Tj2...Tj(k-1)Determining that the number of pixel matrixes in each cluster is different under the general condition, and therefore, assuming that a distribution model meeting the occurrence frequency of each discrete font is normal distribution phiμ,σ(x)。
6. The method as claimed in claim 1, wherein the step S5 comprises the following steps:
s5.1, the determined replacement algorithm is as follows:
s5.1.1, counting the total number q of characters in a text of a scripture;
S5.1.2、for(v=1,v≤q,v++)
s5.1.2.1, randomly generating an integer r of 0-9991
S5.1.2.2, if r1<1000 c', then using basic font for the v character, otherwise, using function of normal distribution to generate random number and rounding function to generate a random integer r of 0-92And using the discrete font for the v character as the discrete font r2
7. The method as claimed in claim 1, wherein the step S6 comprises the following steps:
and S6.1, obtaining the replacement font according to the basic font, the discrete font and the replacement algorithm.
8. The method as claimed in claim 1, wherein the step S7 comprises the following steps:
and S7.1, replacing the font in the Microsoft Himalaya edition scripture by using the replacing font, thereby obtaining the Tibetan scripture with font diversity.
CN201911014677.XA 2019-10-24 2019-10-24 Tibetan computer font diversity expression method based on k-means clustering Pending CN110717479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911014677.XA CN110717479A (en) 2019-10-24 2019-10-24 Tibetan computer font diversity expression method based on k-means clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911014677.XA CN110717479A (en) 2019-10-24 2019-10-24 Tibetan computer font diversity expression method based on k-means clustering

Publications (1)

Publication Number Publication Date
CN110717479A true CN110717479A (en) 2020-01-21

Family

ID=69213234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911014677.XA Pending CN110717479A (en) 2019-10-24 2019-10-24 Tibetan computer font diversity expression method based on k-means clustering

Country Status (1)

Country Link
CN (1) CN110717479A (en)

Similar Documents

Publication Publication Date Title
Miton et al. Graphic complexity in writing systems
EP0434930B1 (en) Editing text in an image
CN1497438B (en) Device and method of font generation
JPH06348904A (en) System and method for recognition of handwritten character
JPH08305803A (en) Operating method of learning machine of character template set
JPS61502495A (en) Cryptographic analysis device
JPH0798765A (en) Direction-detecting method and image analyzer
KR20060051590A (en) Simplifying complex character to maintain legibility
US7046847B2 (en) Document processing method, system and medium
KR20090024127A (en) Combiner for improving handwriting recognition
CN115545009B (en) Data processing system for acquiring target text
US20140267302A1 (en) Method and apparatus for personalized handwriting avatar
JP7493937B2 (en) Method, program and system for identifying a sequence of headings in a document
Sharma et al. Primitive feature-based optical character recognition of the Devanagari script
US20210319246A1 (en) Online training data generation for optical character recognition
CN107092902B (en) Character string recognition method and system
CN1497525B (en) Font attribute setting device and font generation method
CN110717479A (en) Tibetan computer font diversity expression method based on k-means clustering
Sodhar et al. Romanized Sindhi rules for text communication
CN111274763A (en) Tibetan carving font diversity expression method based on normal distribution
Vuori et al. Influence of erroneous learning samples on adaptation in on-line handwriting recognition
Kim et al. Digitalizing scheme of handwritten Hanja historical documents
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
Suchenwirth et al. Optical recognition of Chinese characters
Rubinov et al. Classes and clusters in data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200121

WD01 Invention patent application deemed withdrawn after publication