CN106250357A - The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence - Google Patents
The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence Download PDFInfo
- Publication number
- CN106250357A CN106250357A CN201610662188.5A CN201610662188A CN106250357A CN 106250357 A CN106250357 A CN 106250357A CN 201610662188 A CN201610662188 A CN 201610662188A CN 106250357 A CN106250357 A CN 106250357A
- Authority
- CN
- China
- Prior art keywords
- character
- empty
- characters
- tibetan language
- vowel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
- G06F40/129—Handling non-Latin characters, e.g. kana-to-kanji conversion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/53—Processing of non-Latin text
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The method that the invention discloses the sequence of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology, relates to Tibetan information processing technology field, and the present invention is made up of Tibetan language character component identification and Tibetan language character two steps of sequence and two parts.Tibetan language character component identification is the prerequisite of Tibetan language character sequence, after the most correctly identifying the component of Tibetan language character, just can carry out the sequence of Tibetan language character.The method have the benefit that 1, the correct identification of Tibetan language charcter topology.Owing to all of Modern Tibetan word is included in 48 kinds of structures, according to the structure of the method identification Tibetan language character that the present invention proposes, recognition result can reach the accuracy rate of 100%.2, the sort method of the lexcographical order of Tibetan language character.On the basis of identifying Tibetan language character component, Tibetan language character can be ranked up by the sort method proposed according to the present invention, and the result of sequence meets Tibetan language dictionary sequence.Can be widely applied to the sequence of computer Tibetan language data, the layout etc. of Tibetan language dictionary.
Description
Technical field
The present invention relates to Tibetan information processing technology field, be specifically related to a kind of based on Tibetan language character component identification technology
The method of Tibetan language character dictionary sequence.
Background technology
Abbreviation and Key Term definition
Hide word: represent that word referred to as hidden in the Tibetan language character of a Tibetan language syllable.
Hide the bit length of word: constituting the character number hiding word, hiding word for one may be made up of one to seven characters, most seven
Position, minimum one.
Base word: constitute the indispensable component hiding word, with base word as core, by itself and pre-script, upper word adding, back word adding, under add
Word, vowel, again back word adding, (down word adding again) are combined constituting and hide word.
Pre-script:These five letters can be placed on base word previously as pre-script constitute monogram.
Upper word adding:These three letter can be placed over constituting combination.
Down word adding:These three letter can be placed on lower section and constitute combination.
Back word adding:These nine letters can be placed on after base word and constitute combination.
Back word adding again:Can be placed on again after back word adding and constitute combination.
Vowel:These four letters, as syllabic alphabet, are placed on above or below base word.
Down word adding again:Can be placed on below down word adding.
Modern Tibetan word: hide word be by Tibetan language graphic character spell constitute, these characters include modern Tibetan character,
Brahma-sutra Tibetan character, ancient Tibetan language character and Tibetan language spcial character.So-called Modern Tibetan word refers to meet modern Tibetan syntax rule
Tibetan word then, does not includes writing the Tibetan language character of sound of ancient India, not meeting the ancient Tibetan language character of the modern syntax.
1. the general structure of Modern Tibetan word
Tibetan word character form structure is all with a consonant as core, and remaining letter is additional and upper and lower before and after the most based on this
Superposition, is combined into a complete word table structure.Generally Modern Tibetan word character form structure is minimum is a consonant, i.e. individually by
One base word is constituted;At most it is made up of 6 consonants and a vowel sign.Vowel can not independently be write, and can only be added in auxiliary
The top of sound letter or bottom.Core letter all can make base word " base word ", 30 consonants, the equal root of appellation of remaining letter
Gain the name according to the position being added in base word.I.e. being added in the letter before base word " pre-script ", the letter being added on base word is " above adding
Word ", it is added in the letter below base word " down word adding ", is added in base word letter below " back word adding ", adds again after back word adding
Letter is " back word adding again " or " weight back word adding ".
Hide word to be combined by 30 consonants and four vowel sign (referred to as vowel) spellings, the longitudinal direction of Tibetan language
Superposition is upper and lower at base word, and pre-script, back word adding, again back word adding are the single consonant without superposition.In the modern times
In the Tibetan language syntax, Tibetan language character constitutes Tibetan word the strictest constraint, and hiding word for one can have one to seven characters to constitute, wherein
Base word is to constitute to hide the requisite component of word, and the presence or absence of other position upper member is different because of word.The syllable of Tibetan language at most by
Seven characters constitute (as shown in Figure 1), hide word general only one of which vowel sign for one.In four vowels, the second vowel is superimposed upon auxiliary
Below sound character (block) (circle below in Fig. 1 represents vowel), and the first, third and fourth vowel is superimposed upon consonant characters
(vowel that the circle above in Fig. 1 represents) above of (block).Fig. 2 is the Tibetan word example of 7 components.
2, the sequence of word
The sequence of word refers to be discharged according to different priorities by word according to certain rule.Word is ranked up
It is to set up dictionary, the important prerequisite of work such as makes a look up.The at present sequence of Chinese character be broadly divided into according to the sequence of sound sequence and according to
Stroke sorting two ways, and be the letter compared the most successively in two words on same position during English word sequence
Discharge priority, be the size directly comparing word character string code value in a computer, thereby determine that the ordering of word.
Current Chinese and English ordering techniques are the most ripe, and have comparison unification and perfect mark in actual applications
Accurate.
3, the sequence of the dictionary sequence of Tibetan language
The sequence hiding word is a basic research of Tibetan information processing technology, due to the spelling of Tibetan language be different from English and
Chinese character, it is laterally spelling and the nonlinear combination of longitudinally spelling, it is impossible to use sound sequence method and the stroke method of Chinese character, can not be straight
Connect and use for reference the English method directly compared from left to right according to coding.Therefore, the sequence of Tibetan language syllable has certain difficulty.
Tibetan language lexcographical order is to the way of a kind of more science of Tibetan collation.Its realization be by compare Tibetan word each
Character on position determines the sequence hiding word, is finally determined Tibetan language text sequence by Tibetan word sequence.Tibetan language lexcographical order is also artificial
A kind of sequence of regulation, but through long-term use, is also the sequence of a kind of Tibetan collation that people accept, people have been accustomed to
Row.
Prior art 1: 1999 year, pricks west time core in " ordering rule of Tibetan language and the realization of computer auto-sequencing thereof "
Tentatively proposing the sequence thought of Tibetan language in one literary composition, he proposes, when the order of priority of relatively Tibetan language syllable, by base word-> above add
The order of back word adding of word-> pre-script-> down word adding-> vowel sign-> back word adding-> again compares this seven class the most correspondingly
The Sort Priority of component.
This article illustrate only sequence thought substantially, is but not involved with the realization of detailed process, the most how to identify sound
Each component of joint, component how process etc. for disappearance.Further, the most important, although almost all of Tibetan word is all
Being made up of one to seven components, but the kind of component should have eight kinds in fact, this method does not accounts for Zang Zizhong and comprises
The situation of " down word adding again " this component.
Prior art two, 2004, Jiang Di proposed the mathematics model and algorithm of written Tibetan language sequence, by Tibetan language syllable district
Being divided into six member positions, be respectively as follows: base word, pre-script, upper word adding, down word adding, back word adding, vowel, the thought of sequence is: will
Each the concrete character hiding word gives a numerical value, during sequence, is first all converted to by syllable a string by looking into different tables
Sequence of values, then these sequence of values are ranked up.
Similar with technical scheme one, the program lacks the description to Tibetan language syllable component method for splitting.By syllable is turned
It is changed to sequence of values, although facilitate the sequence of final numerical value, but need ceaselessly during being converted to sequence of values
Table look-up, add the load that program is run.Further, unicode coding has the regulation of standard to the order of Tibetan language character, completely
Need not these inquiry tables self-defined.
Prior art three, 2009, Bian Bawang heap etc. proposed based on ISO/IEC10646 Tibetan coded character set standard
The scheme of Tibetan collation.Tibetan language syllable is optimized for 6 components composition, and has arranged six rule functions and be used for
Identify the component hiding word.During sequence, judge excellent according to the order of " base word-upper word adding-pre-script-down word adding-vowel-back word adding "
First level.
Back word adding, again back word adding, again three kinds of components of down word adding are optimized for a kind of component by literary composition, and give tacit consent to when sequence
" same component can only occur once in same syllable ", both rules have certain paradox, can not be accurate
Identification all of Tibetan word, thus affect the result of sequence.
The above only has the explanation of some papers, document, studies the most theoretically, does not see the softest of realization
Part.
Summary of the invention
It is an object of the invention to overcome the shortcoming of prior art, propose a kind of based on Tibetan language character component identification technology
The method of Tibetan language character dictionary sequence, identifies that the accuracy of Tibetan language component is 100%, on the basis identifying Tibetan language character component
On, define the priority of Tibetan language character sequence, correctly Tibetan language character can be ranked up in the method, complete Tibetan language
The sequence of character words canonical ordering row.
The present invention is achieved through the following technical solutions: a kind of Tibetan language character dictionary based on Tibetan language character component identification technology
The method of sequence, comprises the following steps: the structure word structure of Modern Tibetan word is analyzed by S1. according to the Tibetan language syntax, draws Tibetan language
There are 48 kinds of basic structures;S2. priority treatment special construction, first determines whether whether to contain in this character special component syllable, if
There is special component, according still further to the character number in this structure and the structure that judges this special component with or without vowel;S3. Tibetan language is indulged
To the combination block of fixed overlay as a disposed of in its entirety, according to the structure of Tibetan language, " upper word adding+base word ", " base word+under add
Word ", " upper word adding+base word+down word adding " as fixing structure recognition Tibetan language character component, current syllable to be judged at this
A little structures being searched, if finding the structure that just can judge this syllable very well, soon in the structure shown here, then setting up 3 tables, use
In processing fixed structure and identifying spcial character;S4.S4. the Tibetan word without vowel, three components not having superposition is had
There is ambiguity, resettle 1 table and ambiguous 14 characters are carried out special handling;S5. from Tibetan language character with or without vowel and
The position of vowel judges component, carries out component fractionation, by the component of the Tibetan language character of identification according to " pre-script-upper word adding-base
The back word adding of the down word adding-vowel-back word adding of word-down word adding-again-again " eight parts place;S6. Tibetan language character words canonical ordering is determined
Order models, most crucial level i.e. ground floor is base word layer, and from the second layer to layer 7 be respectively upper word adding, pre-script,
Down word adding, vowel, back word adding and back word adding again;S7. one TibetWord structure of definition, the syllable read and identification
Component is stored in a structure, and memory space is mainly used to deposit syllable and component, selects a kind of sort method to be ranked up.
As preferably, S5 is specific as follows: when hide word be 1 character must be just " base word ", be expressed as " absolutely empty character ".
As preferably, S5 is specific as follows: when Tibetan language is 2 characters, then the method for identification means is:
First determine whether whether the 2nd be vowel, be, be expressed as " absolutely empty 2 characters of absolutely empty 1 character ", without vowel,
Then table look-up 1, if it has, be judged as the structure of " upper+yl ", if, table look-up 2 in table 1 not, it is judged that for " base+under ".
As preferably, S5 is specific as follows, when component is 3 characters, then the method for identification means is as follows:
1) first, it is determined that whether the 2nd be vowel, if "Yes", it is expressed as " absolutely empty 2 character 3 characters of absolutely empty 1 character ",
If "No", then forward 2 to);
2) table look-up 3, if table has, be then judged as the structure of " upper+base+under ", without in table 3, then forward 3 to);
3) table look-up 1, it is judged that front 2 the most in Table 1, if had, judging whether the 3rd be vowel, if it is representing
For " the empty 1 character 2 absolutely empty character of character 3 ", if the 3rd is not vowel, then it represents that for " empty 1 absolutely empty empty 3 words of character 2 character
Symbol ";If front 2 the most in Table 1, then forward 4 to) step;
4) table look-up 1 for latter two, it may be judged whether, if it is structure is " 1 character 2 character 3 character ", if rear 2 not
In Table 1, then 5 are forwarded to) step;
5) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, thus obtain " absolutely empty 1
Empty 3 characters of character 2 character " and " absolutely empty 1 absolutely empty 3 characters of character 2 character ";If the most in table 2, then forwarding 6 to) step;
6) table look-up 2 with hide word latter 2, if in table 2, be then " 1 character sky 2 character 3 characters ", if not at table 2
In, then forward 7 to) step;
7) whether interpretation the 3rd is vowel, if vowel, then it represents that for " empty absolutely empty 3 characters of 2 characters of 1 character ", if
It is not vowel, then forwards 8 to) step;
8) pass through whether these three characters of interpretation are 17 kinds of character interpretation structures special in table 4.
As preferably, S5 is specific as follows, when component is 4 characters, then identifies that the method for structure is as follows:
1) first determining whether whether the 2nd be vowel, if vowel, then structure is " absolutely empty 2 character 3 characters of absolutely empty 1 character
4 characters ";If not vowel, then turn 2);
2) table look-up 3, it is judged that whether first 3 of character be the structure of " upper+base+under ", if it is judge that whether the 4th be
Vowel, if vowel, then structure is " empty 3 characters of empty 1 character 2 character ", if not vowel, is then " empty 1 character 2 character 3
Absolutely empty 4 characters of character ";If front 3 the most in table 3, then turn 3);
3) 1 is tabled look-up, it is judged that whether first 2 of character be the structure of " upper+yl ", if it is, judge whether the 3rd be first
Sound, if vowel, then structure is " absolutely empty 3 character 4 characters of empty 1 character 2 character ", if the 3rd is not vowel, then structure is
" empty 1 absolutely empty empty 3 character 4 characters of character 2 character ", if hide word front 2 the most in Table 1, then turn 4);
4) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, if it is, structure
For " empty 3 character 4 characters of absolutely empty 1 character 2 character ", if the 3rd is not vowel, then structure is " absolutely empty 1 character 2 character absolutely empty 3
Character 4 character " represent;If front 2 the most in table 2, then turn 5) step;
5) with rear 3 tables 3 of Tibetan word, if in table 3, then structure is " 1 character 2 character 3 character 4 character ", if do not had
Have, turn 6);
6) 2, the centre of Tibetan language tables look-up 2, if had, judges whether the 4th be vowel, and if it is structure is " 1 character
Empty 4 characters of empty 2 character 3 characters ", if the 4th is not vowel, then structure is " empty 2 absolutely empty 4 characters of character 3 character of 1 character ";
If 2, centre is the most in table 2, then turn 7) step;
7) by judging whether the 3rd be that vowel judges structure.
As preferably, S5 is specific as follows, when component is 5 characters, then identifies that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " 1 character 2 character 3 character 4 character
Empty 5 characters ", if the 5th is not vowel, then turn 2);
2) judge whether the 4th of Tibetan language character be vowel, if vowel, table look-up 3 for first 3 of Tibetan language character, if
In table 3, structure is " empty 4 character 5 characters of empty 1 character 2 character 3 character ";If Tibetan language character front 3 the most in table 3, then
Tabling look-up 1 with 2, the 3 of Tibetan language character, if in Table 1, then structure is " 1 absolutely empty 4 character 5 characters of character 2 character 3 character ";As
The most in Table 1, then structure is " empty 4 character 5 characters of empty 2 character 3 characters of 1 character " to fruit;If the 4th of Tibetan language character the is not unit
Sound, then turn 3);
3) whether the 3rd of Tibetan language character is vowel, if it is, with first 2 table look-up 1, if in Table 1, then structure
For " absolutely empty 3 character 4 character 5 characters of empty 1 character 2 character ";If the most in Table 1, then table look-up 2, if in table 2, then structure
For " empty 3 character 4 character 5 characters of absolutely empty 1 character 2 character ";If the most in table 2, then structure is that " empty 2 characters of 1 character are absolutely empty
3 character 4 character 5 characters ";If the 3rd character is not vowel, then turn 4);
4) tabling look-up 3 with first 3 of Tibetan language character, if had, structure is " empty 1 absolutely empty 4 character 5 words of character 2 character 3 character
Symbol ";Otherwise turn 5);
5) tabling look-up 3 with 3, the centre of Tibetan language character, if had, structure is " 1 character 2 character 3 absolutely empty 5 words of character 4 character
Symbol ", otherwise turn 6);
6) tabling look-up 1 with 2, the 3 of Tibetan language character, if in table, then structure is " 1 absolutely empty empty 4 words of character 2 character 3 character
Accord with 5 characters ", otherwise structure is " empty absolutely empty 4 character 5 characters of 2 character 3 characters of 1 character ".
As preferably, S5 is specific as follows, when component is 6 characters, then identifies that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " 1 character 2 character 3 character 4 character
Empty 5 character 6 characters ", if otherwise turning 2);
2) judge whether the 4th of Tibetan language character be vowel, if it is, table look-up 3 with first 3 of Tibetan language character, if
In table, then structure is " empty 4 character 5 character 6 characters of empty 1 character 2 character 3 character ";If the most in table 3, tabled look-up with 2,3
1, if in table, then structure is " absolutely empty 4 character 5 character 6 characters of 1 character 2 character 3 character ";If otherwise structure is " 1 word
Accord with empty 4 character 5 character 6 characters of empty 2 character 3 characters ";Otherwise structure is " 1 absolutely empty 5 character 6 words of character 2 character 3 character 4 character
Symbol ".
As preferably, S5 is specific as follows, and when component is 7 characters, then structure is " 1 character 2 character 3 character 4 character sky 5
Character 6 character 7 character ".
Compared with prior art, there is advantages that
1, the correct identification of Tibetan language charcter topology.Owing to all of Modern Tibetan word is included in 48 kinds of structures, according to this
The structure of the method identification Tibetan language character that invention proposes, recognition result can reach the accuracy rate of 100%.2, the word of Tibetan language character
The sort method of canonical ordering.On the basis of identifying Tibetan language character component, the sort method proposed according to the present invention can be to Tibetan language
Character is ranked up, and the result of sequence meets Tibetan language dictionary sequence.Can be widely applied to the sequence of computer Tibetan language data, Tibetan language
The layout etc. of dictionary.
Accompanying drawing explanation
Fig. 1 is structural mode figure;
Fig. 2 is the Tibetan word example of 7 components;
Fig. 3 is the syllable identification process of a component;
Fig. 4 is the syllable identification process of two components;
Fig. 5 is the syllable identification process of three components;
Fig. 6 is the syllable identification process of four components;
Fig. 7 is the syllable identification process of five components;
Fig. 8 is the syllable identification process of six components;
Fig. 9 is the hierarchy chart hiding word lexcographical order;
Figure 10 is principal function flow chart.
Detailed description of the invention
Below in conjunction with accompanying drawing, present invention is described further.
The present invention program is made up of Tibetan language character component identification and Tibetan language character two steps of sequence and two parts.Tibetan language word
Symbol component identification is the prerequisite of Tibetan language character sequence, after the most correctly identifying the component of Tibetan language character, just can hide
The sequence of Chinese character.
One, Tibetan language character component identification
Tibetan language character component is identified, tool according to structure, the number of characters of one Tibetan language character of composition of modern Tibetan character
Body method is as follows:
1, the structure word structure of Modern Tibetan word is analyzed by the present invention according to the Tibetan language syntax, show that Tibetan language has 48 kinds of basic knots
Structure, as shown in table 1.
48 kinds of structures of table 1 Modern Tibetan word
2, priority treatment " special construction "
Realization first determines whether whether contain " special component " syllable in this character, so-called " special component " be exactly
" down word adding " also has " down word adding again " component below, though meeting the specification of the present Tibetan language syntax, but this structure is main only
HaveWith2 characters and comprise the character set of these two characters.If there being " special component ", according still further in this structure
Character number and judge with or without vowel should the structure of " special component "
3, the present invention using the combination block of longitudinally fixed for Tibetan language superposition as a disposed of in its entirety.
The topology discovery syntax of research Tibetan language syllable word are to " upper word adding+base word ", " base word+down word adding " and " upper word adding+base
Word+down word adding " restriction of superposition is very strict, and its quantity is the most limited, does not also have any rule, therefore selects these three knots
Structure, as a fixing structure, is searched current syllable to be judged in these structures, if just found in the structure shown here
The structure of this syllable can be judged very well, soon.Set up 4 tables, be used for processing fixed structure and identifying spcial character.
Table name | Describe | |
Table 1 | shang_ji[33] | Upper word adding+base word |
Table 2 | ji_xia[36] | Base word+down word adding |
Table 3 | shang_ji_xia[15] | Upper word adding+base word+down word adding |
4,14 characters " ambiguity " carry out special handling
The Tibetan word of 3 components has some to have " ambiguity ", asBoth can be identified as " pre-script+base word+after
Add word ", it is also possible to it is identified as " base word+back word adding+again back word adding ", for this syllable-like, algorithm needs do special handling.Warp
Crossing manual sorting, find 14 to have ambiguous special syllable altogether, it is as shown in the table, arranges these 14 syllables in the algorithm and all presses
Structure according to " base word+back word adding+again back word adding " processes.
5, the component of the particularity identification Tibetan language character of vowel is utilized
The present invention makes full use of again 4 vowels of special component of Tibetan language character, from Tibetan language character with or without vowel and unit
The position of sound judges component.Carry out component fractionation, by the component of the Tibetan language character of identification according to " pre-script-upper word adding-base
The back word adding of the down word adding-vowel-back word adding of word-down word adding-again-again " eight parts place, and (general Tibetan language character is all 7 structures
Part, only has " special component " just to have " down word adding again " component as previously mentioned), the component lacked is mended with " empty ", " 0 " or " NULL "
Together.
Specifically, detailed processing procedure is as follows:
The present invention identifies the structure of Tibetan language character according to the structure of modern Tibetan character and the number of characters of a Tibetan language character
Part.Flow chart below figure 3.
The present invention removes after " special component ", according to character number Tibetan language character processed be divided into the feelings that 7 kinds are different
Condition processes respectively, and in 7, situation is corresponding to constituting 1-7 number of components of Tibetan language character.Concrete according to seven kinds of situation identification means
Method is as follows:
If 1 Tibetan word is 1 character must be just " base word ", it is expressed as " absolutely empty character "
If 2 Tibetan language are 2 characters, then the method for identification means is:
First determine whether whether the 2nd be vowel, be, be expressed as " absolutely empty 2 characters of absolutely empty 1 character ", without vowel,
Then table look-up 1, if it has, be judged as the structure of " upper+yl ", if, table look-up 2 in table 1 not, it is judged that for " base+under ".See flow process
Figure is such as Fig. 4.
If 3 components are 3 characters, then the method for identification means is as follows:
1) first, it is determined that whether the 2nd be vowel, if "Yes", it is expressed as " absolutely empty 2 character 3 characters of absolutely empty 1 character ",
If "No", then forward 2 to);
2) table look-up 3, if table has, be then judged as the structure of " upper+base+under ", without in table 3, then forward 3 to);
3) table look-up 1, it is judged that front 2 the most in Table 1, if had, judging whether the 3rd be vowel, if it is representing
For " the empty 1 character 2 absolutely empty character of character 3 ", if the 3rd is not vowel, then it represents that for " empty 1 absolutely empty empty 3 words of character 2 character
Symbol ";If front 2 the most in Table 1, then forward 4 to) step;
4) table look-up 1 for latter two, it may be judged whether, if it is structure is " 1 character 2 character 3 character ", if rear 2 not
In Table 1, then 5 are forwarded to) step;
5) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, thus obtain " absolutely empty 1
Empty 3 characters of character 2 character " and " absolutely empty 1 absolutely empty 3 characters of character 2 character ";If the most in table 2, then forwarding 6 to) step;
6) table look-up 2 with hide word latter 2, if in table 2, be then " 1 character sky 2 character 3 characters ", if not at table 2
In, then forward 7 to) step;
7) whether interpretation the 3rd is vowel, if vowel, then it represents that for " empty absolutely empty 3 characters of 2 characters of 1 character ", if
It is not vowel, then forwards 8 to) step;
8) pass through whether these three characters of interpretation are 17 kinds of character interpretation structures special in table 4.
Flow chart such as Fig. 5.
If 4 components are 4 characters, then identify that the method for structure is as follows:
1) first determining whether whether the 2nd be vowel, if vowel, then structure is " absolutely empty 2 character 3 characters of absolutely empty 1 character
4 characters ";If not vowel, then turn 2);
2) table look-up 3, it is judged that whether first 3 of character be the structure of " upper+base+under ", if it is judge that whether the 4th be
Vowel, if vowel, then structure is " empty 3 characters of empty 1 character 2 character ", if not vowel, is then " empty 1 character 2 character 3
Absolutely empty 4 characters of character ";If front 3 the most in table 3, then turn 3);
3) 1 is tabled look-up, it is judged that whether first 2 of character be the structure of " upper+yl ", if it is, judge whether the 3rd be first
Sound, if vowel, then structure is " absolutely empty 3 character 4 characters of empty 1 character 2 character ", if the 3rd is not vowel, then structure is
" empty 1 absolutely empty empty 3 character 4 characters of character 2 character ", if hide word front 2 the most in Table 1, then turn 4);
4) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, if it is, structure
For " empty 3 character 4 characters of absolutely empty 1 character 2 character ", if the 3rd is not vowel, then structure is " absolutely empty 1 character 2 character absolutely empty 3
Character 4 character " represent;If front 2 the most in table 2, then turn 5) step;
5) with rear 3 tables 3 of Tibetan word, if in table 3, then structure is " 1 character 2 character 3 character 4 character ", if do not had
Have, turn 6);
6) 2, the centre of Tibetan language tables look-up 2, if had, judges whether the 4th be vowel, and if it is structure is " 1 character
Empty 4 characters of empty 2 character 3 characters ", if the 4th is not vowel, then structure is " empty 2 absolutely empty 4 characters of character 3 character of 1 character ";
If 2, centre is the most in table 2, then turn 7) step;
7) by judging whether the 3rd be that vowel judges structure.
Particular flow sheet such as Fig. 6.
If 5 components are 5 characters, then identify that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " 1 character 2 character 3 character 4 character
Empty 5 characters ", if the 5th is not vowel, then turn 2);
2) judge whether the 4th of Tibetan language character be vowel, if vowel, table look-up 3 for first 3 of Tibetan language character, if
In table 3, structure is " empty 4 character 5 characters of empty 1 character 2 character 3 character ";If Tibetan language character front 3 the most in table 3, then
Tabling look-up 1 with 2, the 3 of Tibetan language character, if in Table 1, then structure is " 1 absolutely empty 4 character 5 characters of character 2 character 3 character ";As
The most in Table 1, then structure is " empty 4 character 5 characters of empty 2 character 3 characters of 1 character " to fruit;If the 4th of Tibetan language character the is not unit
Sound, then turn 3);
3) whether the 3rd of Tibetan language character is vowel, if it is, with first 2 table look-up 1, if in Table 1, then structure
For " absolutely empty 3 character 4 character 5 characters of empty 1 character 2 character ";If the most in Table 1, then table look-up 2, if in table 2, then structure
For " empty 3 character 4 character 5 characters of absolutely empty 1 character 2 character ";If the most in table 2, then structure is that " empty 2 characters of 1 character are absolutely empty
3 character 4 character 5 characters ";If the 3rd character is not vowel, then turn 4);
4) tabling look-up 3 with first 3 of Tibetan language character, if had, structure is " empty 1 absolutely empty 4 character 5 words of character 2 character 3 character
Symbol ";Otherwise turn 5);
5) tabling look-up 3 with 3, the centre of Tibetan language character, if had, structure is " 1 character 2 character 3 absolutely empty 5 words of character 4 character
Symbol ", otherwise turn 6);
6) tabling look-up 1 with 2, the 3 of Tibetan language character, if in table, then structure is " 1 absolutely empty empty 4 words of character 2 character 3 character
Accord with 5 characters ", otherwise structure is " empty absolutely empty 4 character 5 characters of 2 character 3 characters of 1 character ".
Particular flow sheet such as Fig. 7.
If 6 components are 6 characters, then identify that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " 1 character 2 character 3 character 4 character
Empty 5 character 6 characters ", if otherwise turning 2);
2) judge whether the 4th of Tibetan language character be vowel, if it is, table look-up 3 with first 3 of Tibetan language character, if
In table, then structure is " empty 4 character 5 character 6 characters of empty 1 character 2 character 3 character ";If the most in table 3, tabled look-up with 2,3
1, if in table, then structure is " absolutely empty 4 character 5 character 6 characters of 1 character 2 character 3 character ";If otherwise structure is " 1 word
Accord with empty 4 character 5 character 6 characters of empty 2 character 3 characters ";Otherwise structure is " 1 absolutely empty 5 character 6 words of character 2 character 3 character 4 character
Symbol ".
Particular flow sheet such as Fig. 8.
If 7 characters are 7 characters, then structure is " empty 5 character 6 character 7 characters of 1 character 2 character 3 character 4 character ".
Two, Tibetan language character sequence
1, the present invention determines the order models of Tibetan language character words canonical ordering
It has been investigated that, owing to Tibetan language lexcographical order is a kind of sequence of artificial regulation, so the sequence that different dictionaries is to word
Row regulation is the most otherwise varied.Obtaining hiding the dictionary sequence of word after ordering scenario analysis by dictionaries such as " hide Chinese dictionary " is point
Layer circulation.As it is shown in figure 9, be divided on level:
Most crucial level i.e. ground floor is base word layer, and this is to constitute the basic of each Tibetan word and must indispensable structure
Part, and be upper word adding, pre-script, down word adding, vowel, back word adding and back word adding, the second layer more respectively from the second layer to layer 7
Character on layer 7 be not constitute hide word must indispensable composition, i.e. according to hide word difference, these compositions are permissible
Lack, figure represents this composition lacked with 0.
The dictionary sequence of Modern Tibetan word is, with base word as core, the character layered combination with two to seven layers, each layer again with its
The character of outer layer combines successively, and wherein the consonant sequence of component is Tibetan language lexicographic ordering.Illustrate: the first character in lexcographical order isWith
0 combination of other six layers;Second word isWith the 0 of the second to layer 5, layer 6Combination;Layer 7 back word adding again must be added in
After back word adding, it is also possible to think that single back word adding is with the result of back word adding 0 combination again.The like, the sequence of dictionary character should
This be: (if any)
What the Tibetan language character that 2, present invention determine that sorted realizes process
(1) Tibetan language character storage mode in a computer
The structure of Tibetan collation Tibetan language to be compared syllable word, so first having to identification means after reading Tibetan word from text,
Component and Tibetan language syllable word will be as elements, therefore one TibetWord structure of definition, the syllable read and identification
Component is stored in a structure.Memory space is mainly used to deposit syllable and component, and Array for structural body is defined as follows:
(2) sort method of Tibetan language character
Selecting a kind of sort method to be ranked up, in hiding word sequence, circulation detailed process is as follows,
1. compare base word, i.e. the 3rd, if base word is unequal, then compare end, return the comparative result of base word, otherwise hold
2. row walks;
2. compare upper word adding, i.e. the 2nd, if upper word adding is unequal, then compare end, return the comparative result of upper word adding,
Otherwise perform the 3. to walk;
3. compare pre-script, i.e. the 1st, if pre-script is unequal, then compare end, return the comparative result of pre-script,
Otherwise perform the 4. to walk;
4. compare down word adding, i.e. the 4th, if down word adding is unequal, then compare end, return the comparative result of down word adding,
Otherwise perform the 5. to walk;
5. compare down word adding again, i.e. the 5th, if down word adding is unequal again, then compare end, return again the comparison of down word adding
As a result, otherwise perform the 6. to walk;
6. compare vowel, i.e. the 6th, if vowel is unequal, then compare end, return the comparative result of vowel, otherwise hold
7. row walks;
7. compare back word adding, i.e. the 7th, if back word adding is unequal, then compare end, return the comparative result of back word adding,
Otherwise perform the 8. to walk;
8. compare back word adding again, i.e. the 8th, return again the comparative result of back word adding, compare end.
(3) it is applied in a kind of concrete sorting by computer method
The sort algorithm selected is different, and the flow chart of function is the most different, carries out Tibetan language character sequence with " merger sequence "
Flow chart is as shown in Figure 10.
Claims (8)
1. the method for a Tibetan language character dictionary based on Tibetan language character component identification technology sequence, it is characterised in that include with
The structure word structure of Modern Tibetan word is analyzed by lower step: S1. according to the Tibetan language syntax, show that Tibetan language has 48 kinds of basic structures;S2.
Priority treatment special construction, first determines whether whether to contain in this character special component syllable, if there being special component, according still further to this
Character number in structure and judge the structure of this special component with or without vowel;S3. the combination block of longitudinally fixed for Tibetan language superposition
As a disposed of in its entirety, according to the structure of Tibetan language, " upper word adding+base word ", " base word+down word adding ", " upper word adding+base word+under
Add word " as fixing structure recognition Tibetan language character component, current syllable to be judged is searched in these structures, if
This structure finds the structure that just can judge this syllable very well, soon, then sets up 3 tables, be used for processing fixed structure and knowledge
Other spcial character;S4. there are some to have ambiguity the Tibetan word without vowel, three components not having superposition, resettle 1 table
Ambiguous 14 characters are carried out special handling;S5. judge component from Tibetan language character with or without the position of vowel and vowel, enter
Row component splits, by the component of Tibetan language character that identifies according to " down word adding-the unit of pre-script-upper word adding-base word-down word adding-again
The back word adding of sound-back word adding-again " eight parts place;S6. the order models of Tibetan language character words canonical ordering, most crucial level are determined
I.e. ground floor is base word layer, and is upper word adding, pre-script, down word adding, vowel, back word adding and more respectively from the second layer to layer 7
Back word adding;S7. one TibetWord structure of definition, is stored in the component of the syllable read and identification in one structure, deposits
Storage space is mainly used to deposit syllable and component, selects a kind of sort method to be ranked up.
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 1 sequence
Method, it is characterised in that S5 is specific as follows: when hide word be 1 character must be just " base word ", be expressed as " absolutely empty character ".
A kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 1 and 2 sequence
Method, it is characterised in that S5 is specific as follows: when Tibetan language is 2 characters, then the method for identification means is:
First determine whether whether the 2nd be vowel, be, be expressed as " absolutely empty 2 characters of absolutely empty 1 character ", without vowel, then look into
Table 1, the structure of " upper+yl " if it has, be judged as, if, table look-up 2 in table 1 not, it is judged that for " base+under ".
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 3 sequence
Method, it is characterised in that S5 is specific as follows, when component is 3 characters, then the method for identification means is as follows:
1) first, it is determined that whether the 2nd be vowel, if "Yes", it is expressed as " absolutely empty 2 character 3 characters of absolutely empty 1 character ", if
"No", then forward 2 to);
2) table look-up 3, if table has, be then judged as the structure of " upper+base+under ", without in table 3, then forward 3 to);
3) table look-up 1, it is judged that front 2 the most in Table 1, if had, judging whether the 3rd be vowel, being if it is expressed as
" the empty 1 character 2 absolutely empty character of character 3 ", if the 3rd is not vowel, then it represents that for " empty 1 absolutely empty empty 3 characters of character 2 character ";
If front 2 the most in Table 1, then forward 4 to) step;
4) table look-up 1 for latter two, it may be judged whether, if it is structure is " 1 character 2 character 3 character ", if rear 2 not at table 1
In, then forward 5 to) step;
5) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, thus obtain " absolutely empty 1 character
Empty 3 characters of 2 characters " and " absolutely empty 1 absolutely empty 3 characters of character 2 character ";If the most in table 2, then forwarding 6 to) step;
6) tabling look-up 2 with hide word latter 2, if in table 2, be then " empty 2 character 3 characters of 1 character ", if not having the most in table 2, then
Forward 7 to) step;
7) whether interpretation the 3rd is vowel, if vowel, then it represents that for " empty absolutely empty 3 characters of 2 characters of 1 character ", if not
Vowel, then forward 8 to) step;
8) pass through whether these three characters of interpretation are 17 kinds of character interpretation structures special in table 4.
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 4 sequence
Method, it is characterised in that S5 is specific as follows, when component is 4 characters, then identifies that the method for structure is as follows:
1) first determining whether whether the 2nd be vowel, if vowel, then structure is " absolutely empty 2 character 3 character 4 words of absolutely empty 1 character
Symbol ";If not vowel, then turn 2);
2) 3 are tabled look-up, it is judged that whether first 3 of character be the structure of " upper+base+under ", if it is judges whether the 4th be first
Sound, if vowel, then structure is " empty 3 characters of empty 1 character 2 character ", if not vowel, is then " empty 1 character 2 character 3 word
Accord with absolutely empty 4 characters ";If front 3 the most in table 3, then turn 3);
3) 1 is tabled look-up, it is judged that whether first 2 of character be the structure of " upper+yl ", if it is, judge whether the 3rd be vowel,
If vowel, then structure is " absolutely empty 3 character 4 characters of empty 1 character 2 character ", if the 3rd is not vowel, then structure is " empty
1 absolutely empty empty 3 character 4 characters of character 2 character ", if hide word front 2 the most in Table 1, then turn 4);
4) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, if it is, structure be " sky
Empty 3 character 4 characters of empty 1 character 2 character ", if the 3rd is not vowel, then structure is " absolutely empty 1 absolutely empty 3 characters 4 of character 2 character
Character " represent;If front 2 the most in table 2, then turn 5) step;
5) with rear 3 tables 3 of Tibetan word, if in table 3, then structure is " 1 character 2 character 3 character 4 character ", without then
Turn 6);
6) 2, the centre of Tibetan language tables look-up 2, if had, judges whether the 4th be vowel, and if it is structure is " 1 character sky 2
Empty 4 characters of character 3 character ", if the 4th is not vowel, then structure is " empty 2 absolutely empty 4 characters of character 3 character of 1 character ";If
2, centre the most in table 2, then turns 7) step;
7) by judging whether the 3rd be that vowel judges structure.
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 5 sequence
Method, it is characterised in that S5 is specific as follows, when component is 5 characters, then identifies that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " empty 5 words of 1 character 2 character 3 character 4 character
Symbol ", if the 5th is not vowel, then turn 2);
2) judge whether the 4th of Tibetan language character be vowel, if vowel, table look-up 3, if at table 3 for first 3 of Tibetan language character
In, structure is " empty 4 character 5 characters of empty 1 character 2 character 3 character ";If Tibetan language character front 3 the most in table 3, then with hide
Tabling look-up 1 for the 2 of Chinese character, 3, if in Table 1, then structure is " 1 absolutely empty 4 character 5 characters of character 2 character 3 character ";If no
In Table 1, then structure is " empty 4 character 5 characters of empty 2 character 3 characters of 1 character ";If the 4th of Tibetan language character the is not vowel,
Then turn 3);
3) whether the 3rd of Tibetan language character is vowel, if it is, table look-up 1 with first 2, if in Table 1, then structure is " empty
Absolutely empty 3 character 4 character 5 characters of 1 character 2 character ";If the most in Table 1, then tabling look-up 2, if in table 2, then structure is " empty
Empty 3 character 4 character 5 characters of empty 1 character 2 character ";If the most in table 2, then structure is " empty absolutely empty 3 characters of 2 characters of 1 character
4 character 5 characters ";If the 3rd character is not vowel, then turn 4);
4) tabling look-up 3 with first 3 of Tibetan language character, if had, structure is " empty 1 absolutely empty 4 character 5 characters of character 2 character 3 character ";
Otherwise turn 5);
5) tabling look-up 3 with 3, the centre of Tibetan language character, if had, structure is " 1 character 2 character 3 absolutely empty 5 characters of character 4 character ",
Otherwise turn 6);
6) tabling look-up 1 with 2, the 3 of Tibetan language character, if in table, then structure is " 1 absolutely empty empty 4 characters 5 of character 2 character 3 character
Character ", otherwise structure is " empty absolutely empty 4 character 5 characters of 2 character 3 characters of 1 character ".
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 6 sequence
Method, it is characterised in that S5 is specific as follows, when component is 6 characters, then identifies that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " empty 5 words of 1 character 2 character 3 character 4 character
Accord with 6 characters ", if otherwise turning 2);
2) judge whether the 4th of Tibetan language character be vowel, if it is, table look-up 3, if at table with first 3 of Tibetan language character
In, then structure is " empty 4 character 5 character 6 characters of empty 1 character 2 character 3 character ";If the most not in table 3, table look-up 1 with 2,3, as
Fruit is in table, then structure is " absolutely empty 4 character 5 character 6 characters of 1 character 2 character 3 character ";If otherwise structure is " 1 character sky 2
Empty 4 character 5 character 6 characters of character 3 character ";Otherwise structure is " 1 absolutely empty 5 character 6 characters of character 2 character 3 character 4 character ".
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 7 sequence
Method, it is characterised in that S5 is specific as follows, when component is 7 characters, then structure is " empty 5 characters of 1 character 2 character 3 character 4 character
6 character 7 characters ".
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610662188.5A CN106250357A (en) | 2016-08-12 | 2016-08-12 | The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610662188.5A CN106250357A (en) | 2016-08-12 | 2016-08-12 | The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106250357A true CN106250357A (en) | 2016-12-21 |
Family
ID=57591745
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610662188.5A Pending CN106250357A (en) | 2016-08-12 | 2016-08-12 | The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106250357A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818673A (en) * | 2021-01-28 | 2021-05-18 | 青海民族大学 | Tibetan character component identification method based on segmentation and structure |
CN112818640A (en) * | 2021-01-28 | 2021-05-18 | 青海民族大学 | Tibetan ordering method based on hash function |
-
2016
- 2016-08-12 CN CN201610662188.5A patent/CN106250357A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818673A (en) * | 2021-01-28 | 2021-05-18 | 青海民族大学 | Tibetan character component identification method based on segmentation and structure |
CN112818640A (en) * | 2021-01-28 | 2021-05-18 | 青海民族大学 | Tibetan ordering method based on hash function |
CN112818673B (en) * | 2021-01-28 | 2023-08-22 | 青海民族大学 | Tibetan character component recognition method based on segmentation and structure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Neeleman et al. | Radical pro drop and the morphology of pronouns | |
Durrani et al. | Urdu word segmentation | |
CN105426711A (en) | Similarity detection method of computer software source code | |
Mohamed et al. | Arabic Part of Speech Tagging. | |
Aswani et al. | A hybrid approach to align sentences and words in English-Hindi parallel corpora | |
CN106250357A (en) | The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence | |
Alnefaie et al. | Automatic minimal diacritization of Arabic texts | |
Kaur et al. | Hybrid approach for spell checker and grammar checker for Punjabi | |
Kaur et al. | Spell checker for Punjabi language using deep neural network | |
Abu Bakar et al. | NUWT: Jawi-specific Buckwalter corpus for Malays word tokenization | |
Randhawa et al. | Study of spell checking techniques and available spell checkers in regional languages: a survey | |
Belaid et al. | Part-of-speech tagging for table of contents recognition | |
Aswani et al. | Aligning words in English-Hindi parallel corpora | |
Mijlad et al. | Arabic text diacritization: Overview and solution | |
Baruah et al. | Design and development of soundex for assamese language | |
Bouma et al. | Syllabification in Middle Dutch | |
Das et al. | Design and implementation of a spell checker for Assamese | |
Hanu et al. | Aspects Revealing the Orthography and Punctuation Impact in Printed Romanian: A Literary Corpus Based Study | |
Seyyed et al. | PTokenizer: POS tagger tokenizer | |
Renner | English cum, a borrowed coordinator turned complex-compound marker | |
Hoque et al. | Coding system for bangla spell checker | |
Lehal et al. | Automatic Bilingual Legacy-Fonts Identification and Conversion System. | |
Cheng et al. | SpellChef: spelling checker and corrector for Filipino | |
Asahiah et al. | Diacritic-aware yorùbá spell checker | |
Akash et al. | An approach to sort Unicode based Bengali text using Trie |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161221 |
|
RJ01 | Rejection of invention patent application after publication |