CN106250357A - The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence - Google Patents

The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence Download PDF

Info

Publication number
CN106250357A
CN106250357A CN201610662188.5A CN201610662188A CN106250357A CN 106250357 A CN106250357 A CN 106250357A CN 201610662188 A CN201610662188 A CN 201610662188A CN 106250357 A CN106250357 A CN 106250357A
Authority
CN
China
Prior art keywords
character
empty
characters
tibetan language
vowel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610662188.5A
Other languages
Chinese (zh)
Inventor
高定国
格桑多吉
普次仁
高红梅
李苗苗
巴桑卓玛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610662188.5A priority Critical patent/CN106250357A/en
Publication of CN106250357A publication Critical patent/CN106250357A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The method that the invention discloses the sequence of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology, relates to Tibetan information processing technology field, and the present invention is made up of Tibetan language character component identification and Tibetan language character two steps of sequence and two parts.Tibetan language character component identification is the prerequisite of Tibetan language character sequence, after the most correctly identifying the component of Tibetan language character, just can carry out the sequence of Tibetan language character.The method have the benefit that 1, the correct identification of Tibetan language charcter topology.Owing to all of Modern Tibetan word is included in 48 kinds of structures, according to the structure of the method identification Tibetan language character that the present invention proposes, recognition result can reach the accuracy rate of 100%.2, the sort method of the lexcographical order of Tibetan language character.On the basis of identifying Tibetan language character component, Tibetan language character can be ranked up by the sort method proposed according to the present invention, and the result of sequence meets Tibetan language dictionary sequence.Can be widely applied to the sequence of computer Tibetan language data, the layout etc. of Tibetan language dictionary.

Description

The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence
Technical field
The present invention relates to Tibetan information processing technology field, be specifically related to a kind of based on Tibetan language character component identification technology The method of Tibetan language character dictionary sequence.
Background technology
Abbreviation and Key Term definition
Hide word: represent that word referred to as hidden in the Tibetan language character of a Tibetan language syllable.
Hide the bit length of word: constituting the character number hiding word, hiding word for one may be made up of one to seven characters, most seven Position, minimum one.
Base word: constitute the indispensable component hiding word, with base word as core, by itself and pre-script, upper word adding, back word adding, under add Word, vowel, again back word adding, (down word adding again) are combined constituting and hide word.
Pre-script:These five letters can be placed on base word previously as pre-script constitute monogram.
Upper word adding:These three letter can be placed over constituting combination.
Down word adding:These three letter can be placed on lower section and constitute combination.
Back word adding:These nine letters can be placed on after base word and constitute combination.
Back word adding again:Can be placed on again after back word adding and constitute combination.
Vowel:These four letters, as syllabic alphabet, are placed on above or below base word.
Down word adding again:Can be placed on below down word adding.
Modern Tibetan word: hide word be by Tibetan language graphic character spell constitute, these characters include modern Tibetan character, Brahma-sutra Tibetan character, ancient Tibetan language character and Tibetan language spcial character.So-called Modern Tibetan word refers to meet modern Tibetan syntax rule Tibetan word then, does not includes writing the Tibetan language character of sound of ancient India, not meeting the ancient Tibetan language character of the modern syntax.
1. the general structure of Modern Tibetan word
Tibetan word character form structure is all with a consonant as core, and remaining letter is additional and upper and lower before and after the most based on this Superposition, is combined into a complete word table structure.Generally Modern Tibetan word character form structure is minimum is a consonant, i.e. individually by One base word is constituted;At most it is made up of 6 consonants and a vowel sign.Vowel can not independently be write, and can only be added in auxiliary The top of sound letter or bottom.Core letter all can make base word " base word ", 30 consonants, the equal root of appellation of remaining letter Gain the name according to the position being added in base word.I.e. being added in the letter before base word " pre-script ", the letter being added on base word is " above adding Word ", it is added in the letter below base word " down word adding ", is added in base word letter below " back word adding ", adds again after back word adding Letter is " back word adding again " or " weight back word adding ".
Hide word to be combined by 30 consonants and four vowel sign (referred to as vowel) spellings, the longitudinal direction of Tibetan language Superposition is upper and lower at base word, and pre-script, back word adding, again back word adding are the single consonant without superposition.In the modern times In the Tibetan language syntax, Tibetan language character constitutes Tibetan word the strictest constraint, and hiding word for one can have one to seven characters to constitute, wherein Base word is to constitute to hide the requisite component of word, and the presence or absence of other position upper member is different because of word.The syllable of Tibetan language at most by Seven characters constitute (as shown in Figure 1), hide word general only one of which vowel sign for one.In four vowels, the second vowel is superimposed upon auxiliary Below sound character (block) (circle below in Fig. 1 represents vowel), and the first, third and fourth vowel is superimposed upon consonant characters (vowel that the circle above in Fig. 1 represents) above of (block).Fig. 2 is the Tibetan word example of 7 components.
2, the sequence of word
The sequence of word refers to be discharged according to different priorities by word according to certain rule.Word is ranked up It is to set up dictionary, the important prerequisite of work such as makes a look up.The at present sequence of Chinese character be broadly divided into according to the sequence of sound sequence and according to Stroke sorting two ways, and be the letter compared the most successively in two words on same position during English word sequence Discharge priority, be the size directly comparing word character string code value in a computer, thereby determine that the ordering of word. Current Chinese and English ordering techniques are the most ripe, and have comparison unification and perfect mark in actual applications Accurate.
3, the sequence of the dictionary sequence of Tibetan language
The sequence hiding word is a basic research of Tibetan information processing technology, due to the spelling of Tibetan language be different from English and Chinese character, it is laterally spelling and the nonlinear combination of longitudinally spelling, it is impossible to use sound sequence method and the stroke method of Chinese character, can not be straight Connect and use for reference the English method directly compared from left to right according to coding.Therefore, the sequence of Tibetan language syllable has certain difficulty.
Tibetan language lexcographical order is to the way of a kind of more science of Tibetan collation.Its realization be by compare Tibetan word each Character on position determines the sequence hiding word, is finally determined Tibetan language text sequence by Tibetan word sequence.Tibetan language lexcographical order is also artificial A kind of sequence of regulation, but through long-term use, is also the sequence of a kind of Tibetan collation that people accept, people have been accustomed to Row.
Prior art 1: 1999 year, pricks west time core in " ordering rule of Tibetan language and the realization of computer auto-sequencing thereof " Tentatively proposing the sequence thought of Tibetan language in one literary composition, he proposes, when the order of priority of relatively Tibetan language syllable, by base word-> above add The order of back word adding of word-> pre-script-> down word adding-> vowel sign-> back word adding-> again compares this seven class the most correspondingly The Sort Priority of component.
This article illustrate only sequence thought substantially, is but not involved with the realization of detailed process, the most how to identify sound Each component of joint, component how process etc. for disappearance.Further, the most important, although almost all of Tibetan word is all Being made up of one to seven components, but the kind of component should have eight kinds in fact, this method does not accounts for Zang Zizhong and comprises The situation of " down word adding again " this component.
Prior art two, 2004, Jiang Di proposed the mathematics model and algorithm of written Tibetan language sequence, by Tibetan language syllable district Being divided into six member positions, be respectively as follows: base word, pre-script, upper word adding, down word adding, back word adding, vowel, the thought of sequence is: will Each the concrete character hiding word gives a numerical value, during sequence, is first all converted to by syllable a string by looking into different tables Sequence of values, then these sequence of values are ranked up.
Similar with technical scheme one, the program lacks the description to Tibetan language syllable component method for splitting.By syllable is turned It is changed to sequence of values, although facilitate the sequence of final numerical value, but need ceaselessly during being converted to sequence of values Table look-up, add the load that program is run.Further, unicode coding has the regulation of standard to the order of Tibetan language character, completely Need not these inquiry tables self-defined.
Prior art three, 2009, Bian Bawang heap etc. proposed based on ISO/IEC10646 Tibetan coded character set standard The scheme of Tibetan collation.Tibetan language syllable is optimized for 6 components composition, and has arranged six rule functions and be used for Identify the component hiding word.During sequence, judge excellent according to the order of " base word-upper word adding-pre-script-down word adding-vowel-back word adding " First level.
Back word adding, again back word adding, again three kinds of components of down word adding are optimized for a kind of component by literary composition, and give tacit consent to when sequence " same component can only occur once in same syllable ", both rules have certain paradox, can not be accurate Identification all of Tibetan word, thus affect the result of sequence.
The above only has the explanation of some papers, document, studies the most theoretically, does not see the softest of realization Part.
Summary of the invention
It is an object of the invention to overcome the shortcoming of prior art, propose a kind of based on Tibetan language character component identification technology The method of Tibetan language character dictionary sequence, identifies that the accuracy of Tibetan language component is 100%, on the basis identifying Tibetan language character component On, define the priority of Tibetan language character sequence, correctly Tibetan language character can be ranked up in the method, complete Tibetan language The sequence of character words canonical ordering row.
The present invention is achieved through the following technical solutions: a kind of Tibetan language character dictionary based on Tibetan language character component identification technology The method of sequence, comprises the following steps: the structure word structure of Modern Tibetan word is analyzed by S1. according to the Tibetan language syntax, draws Tibetan language There are 48 kinds of basic structures;S2. priority treatment special construction, first determines whether whether to contain in this character special component syllable, if There is special component, according still further to the character number in this structure and the structure that judges this special component with or without vowel;S3. Tibetan language is indulged To the combination block of fixed overlay as a disposed of in its entirety, according to the structure of Tibetan language, " upper word adding+base word ", " base word+under add Word ", " upper word adding+base word+down word adding " as fixing structure recognition Tibetan language character component, current syllable to be judged at this A little structures being searched, if finding the structure that just can judge this syllable very well, soon in the structure shown here, then setting up 3 tables, use In processing fixed structure and identifying spcial character;S4.S4. the Tibetan word without vowel, three components not having superposition is had There is ambiguity, resettle 1 table and ambiguous 14 characters are carried out special handling;S5. from Tibetan language character with or without vowel and The position of vowel judges component, carries out component fractionation, by the component of the Tibetan language character of identification according to " pre-script-upper word adding-base The back word adding of the down word adding-vowel-back word adding of word-down word adding-again-again " eight parts place;S6. Tibetan language character words canonical ordering is determined Order models, most crucial level i.e. ground floor is base word layer, and from the second layer to layer 7 be respectively upper word adding, pre-script, Down word adding, vowel, back word adding and back word adding again;S7. one TibetWord structure of definition, the syllable read and identification Component is stored in a structure, and memory space is mainly used to deposit syllable and component, selects a kind of sort method to be ranked up.
As preferably, S5 is specific as follows: when hide word be 1 character must be just " base word ", be expressed as " absolutely empty character ".
As preferably, S5 is specific as follows: when Tibetan language is 2 characters, then the method for identification means is:
First determine whether whether the 2nd be vowel, be, be expressed as " absolutely empty 2 characters of absolutely empty 1 character ", without vowel, Then table look-up 1, if it has, be judged as the structure of " upper+yl ", if, table look-up 2 in table 1 not, it is judged that for " base+under ".
As preferably, S5 is specific as follows, when component is 3 characters, then the method for identification means is as follows:
1) first, it is determined that whether the 2nd be vowel, if "Yes", it is expressed as " absolutely empty 2 character 3 characters of absolutely empty 1 character ", If "No", then forward 2 to);
2) table look-up 3, if table has, be then judged as the structure of " upper+base+under ", without in table 3, then forward 3 to);
3) table look-up 1, it is judged that front 2 the most in Table 1, if had, judging whether the 3rd be vowel, if it is representing For " the empty 1 character 2 absolutely empty character of character 3 ", if the 3rd is not vowel, then it represents that for " empty 1 absolutely empty empty 3 words of character 2 character Symbol ";If front 2 the most in Table 1, then forward 4 to) step;
4) table look-up 1 for latter two, it may be judged whether, if it is structure is " 1 character 2 character 3 character ", if rear 2 not In Table 1, then 5 are forwarded to) step;
5) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, thus obtain " absolutely empty 1 Empty 3 characters of character 2 character " and " absolutely empty 1 absolutely empty 3 characters of character 2 character ";If the most in table 2, then forwarding 6 to) step;
6) table look-up 2 with hide word latter 2, if in table 2, be then " 1 character sky 2 character 3 characters ", if not at table 2 In, then forward 7 to) step;
7) whether interpretation the 3rd is vowel, if vowel, then it represents that for " empty absolutely empty 3 characters of 2 characters of 1 character ", if It is not vowel, then forwards 8 to) step;
8) pass through whether these three characters of interpretation are 17 kinds of character interpretation structures special in table 4.
As preferably, S5 is specific as follows, when component is 4 characters, then identifies that the method for structure is as follows:
1) first determining whether whether the 2nd be vowel, if vowel, then structure is " absolutely empty 2 character 3 characters of absolutely empty 1 character 4 characters ";If not vowel, then turn 2);
2) table look-up 3, it is judged that whether first 3 of character be the structure of " upper+base+under ", if it is judge that whether the 4th be Vowel, if vowel, then structure is " empty 3 characters of empty 1 character 2 character ", if not vowel, is then " empty 1 character 2 character 3 Absolutely empty 4 characters of character ";If front 3 the most in table 3, then turn 3);
3) 1 is tabled look-up, it is judged that whether first 2 of character be the structure of " upper+yl ", if it is, judge whether the 3rd be first Sound, if vowel, then structure is " absolutely empty 3 character 4 characters of empty 1 character 2 character ", if the 3rd is not vowel, then structure is " empty 1 absolutely empty empty 3 character 4 characters of character 2 character ", if hide word front 2 the most in Table 1, then turn 4);
4) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, if it is, structure For " empty 3 character 4 characters of absolutely empty 1 character 2 character ", if the 3rd is not vowel, then structure is " absolutely empty 1 character 2 character absolutely empty 3 Character 4 character " represent;If front 2 the most in table 2, then turn 5) step;
5) with rear 3 tables 3 of Tibetan word, if in table 3, then structure is " 1 character 2 character 3 character 4 character ", if do not had Have, turn 6);
6) 2, the centre of Tibetan language tables look-up 2, if had, judges whether the 4th be vowel, and if it is structure is " 1 character Empty 4 characters of empty 2 character 3 characters ", if the 4th is not vowel, then structure is " empty 2 absolutely empty 4 characters of character 3 character of 1 character "; If 2, centre is the most in table 2, then turn 7) step;
7) by judging whether the 3rd be that vowel judges structure.
As preferably, S5 is specific as follows, when component is 5 characters, then identifies that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " 1 character 2 character 3 character 4 character Empty 5 characters ", if the 5th is not vowel, then turn 2);
2) judge whether the 4th of Tibetan language character be vowel, if vowel, table look-up 3 for first 3 of Tibetan language character, if In table 3, structure is " empty 4 character 5 characters of empty 1 character 2 character 3 character ";If Tibetan language character front 3 the most in table 3, then Tabling look-up 1 with 2, the 3 of Tibetan language character, if in Table 1, then structure is " 1 absolutely empty 4 character 5 characters of character 2 character 3 character ";As The most in Table 1, then structure is " empty 4 character 5 characters of empty 2 character 3 characters of 1 character " to fruit;If the 4th of Tibetan language character the is not unit Sound, then turn 3);
3) whether the 3rd of Tibetan language character is vowel, if it is, with first 2 table look-up 1, if in Table 1, then structure For " absolutely empty 3 character 4 character 5 characters of empty 1 character 2 character ";If the most in Table 1, then table look-up 2, if in table 2, then structure For " empty 3 character 4 character 5 characters of absolutely empty 1 character 2 character ";If the most in table 2, then structure is that " empty 2 characters of 1 character are absolutely empty 3 character 4 character 5 characters ";If the 3rd character is not vowel, then turn 4);
4) tabling look-up 3 with first 3 of Tibetan language character, if had, structure is " empty 1 absolutely empty 4 character 5 words of character 2 character 3 character Symbol ";Otherwise turn 5);
5) tabling look-up 3 with 3, the centre of Tibetan language character, if had, structure is " 1 character 2 character 3 absolutely empty 5 words of character 4 character Symbol ", otherwise turn 6);
6) tabling look-up 1 with 2, the 3 of Tibetan language character, if in table, then structure is " 1 absolutely empty empty 4 words of character 2 character 3 character Accord with 5 characters ", otherwise structure is " empty absolutely empty 4 character 5 characters of 2 character 3 characters of 1 character ".
As preferably, S5 is specific as follows, when component is 6 characters, then identifies that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " 1 character 2 character 3 character 4 character Empty 5 character 6 characters ", if otherwise turning 2);
2) judge whether the 4th of Tibetan language character be vowel, if it is, table look-up 3 with first 3 of Tibetan language character, if In table, then structure is " empty 4 character 5 character 6 characters of empty 1 character 2 character 3 character ";If the most in table 3, tabled look-up with 2,3 1, if in table, then structure is " absolutely empty 4 character 5 character 6 characters of 1 character 2 character 3 character ";If otherwise structure is " 1 word Accord with empty 4 character 5 character 6 characters of empty 2 character 3 characters ";Otherwise structure is " 1 absolutely empty 5 character 6 words of character 2 character 3 character 4 character Symbol ".
As preferably, S5 is specific as follows, and when component is 7 characters, then structure is " 1 character 2 character 3 character 4 character sky 5 Character 6 character 7 character ".
Compared with prior art, there is advantages that
1, the correct identification of Tibetan language charcter topology.Owing to all of Modern Tibetan word is included in 48 kinds of structures, according to this The structure of the method identification Tibetan language character that invention proposes, recognition result can reach the accuracy rate of 100%.2, the word of Tibetan language character The sort method of canonical ordering.On the basis of identifying Tibetan language character component, the sort method proposed according to the present invention can be to Tibetan language Character is ranked up, and the result of sequence meets Tibetan language dictionary sequence.Can be widely applied to the sequence of computer Tibetan language data, Tibetan language The layout etc. of dictionary.
Accompanying drawing explanation
Fig. 1 is structural mode figure;
Fig. 2 is the Tibetan word example of 7 components;
Fig. 3 is the syllable identification process of a component;
Fig. 4 is the syllable identification process of two components;
Fig. 5 is the syllable identification process of three components;
Fig. 6 is the syllable identification process of four components;
Fig. 7 is the syllable identification process of five components;
Fig. 8 is the syllable identification process of six components;
Fig. 9 is the hierarchy chart hiding word lexcographical order;
Figure 10 is principal function flow chart.
Detailed description of the invention
Below in conjunction with accompanying drawing, present invention is described further.
The present invention program is made up of Tibetan language character component identification and Tibetan language character two steps of sequence and two parts.Tibetan language word Symbol component identification is the prerequisite of Tibetan language character sequence, after the most correctly identifying the component of Tibetan language character, just can hide The sequence of Chinese character.
One, Tibetan language character component identification
Tibetan language character component is identified, tool according to structure, the number of characters of one Tibetan language character of composition of modern Tibetan character Body method is as follows:
1, the structure word structure of Modern Tibetan word is analyzed by the present invention according to the Tibetan language syntax, show that Tibetan language has 48 kinds of basic knots Structure, as shown in table 1.
48 kinds of structures of table 1 Modern Tibetan word
2, priority treatment " special construction "
Realization first determines whether whether contain " special component " syllable in this character, so-called " special component " be exactly " down word adding " also has " down word adding again " component below, though meeting the specification of the present Tibetan language syntax, but this structure is main only HaveWith2 characters and comprise the character set of these two characters.If there being " special component ", according still further in this structure Character number and judge with or without vowel should the structure of " special component "
3, the present invention using the combination block of longitudinally fixed for Tibetan language superposition as a disposed of in its entirety.
The topology discovery syntax of research Tibetan language syllable word are to " upper word adding+base word ", " base word+down word adding " and " upper word adding+base Word+down word adding " restriction of superposition is very strict, and its quantity is the most limited, does not also have any rule, therefore selects these three knots Structure, as a fixing structure, is searched current syllable to be judged in these structures, if just found in the structure shown here The structure of this syllable can be judged very well, soon.Set up 4 tables, be used for processing fixed structure and identifying spcial character.
Table name Describe
Table 1 shang_ji[33] Upper word adding+base word
Table 2 ji_xia[36] Base word+down word adding
Table 3 shang_ji_xia[15] Upper word adding+base word+down word adding
4,14 characters " ambiguity " carry out special handling
The Tibetan word of 3 components has some to have " ambiguity ", asBoth can be identified as " pre-script+base word+after Add word ", it is also possible to it is identified as " base word+back word adding+again back word adding ", for this syllable-like, algorithm needs do special handling.Warp Crossing manual sorting, find 14 to have ambiguous special syllable altogether, it is as shown in the table, arranges these 14 syllables in the algorithm and all presses Structure according to " base word+back word adding+again back word adding " processes.
5, the component of the particularity identification Tibetan language character of vowel is utilized
The present invention makes full use of again 4 vowels of special component of Tibetan language character, from Tibetan language character with or without vowel and unit The position of sound judges component.Carry out component fractionation, by the component of the Tibetan language character of identification according to " pre-script-upper word adding-base The back word adding of the down word adding-vowel-back word adding of word-down word adding-again-again " eight parts place, and (general Tibetan language character is all 7 structures Part, only has " special component " just to have " down word adding again " component as previously mentioned), the component lacked is mended with " empty ", " 0 " or " NULL " Together.
Specifically, detailed processing procedure is as follows:
The present invention identifies the structure of Tibetan language character according to the structure of modern Tibetan character and the number of characters of a Tibetan language character Part.Flow chart below figure 3.
The present invention removes after " special component ", according to character number Tibetan language character processed be divided into the feelings that 7 kinds are different Condition processes respectively, and in 7, situation is corresponding to constituting 1-7 number of components of Tibetan language character.Concrete according to seven kinds of situation identification means Method is as follows:
If 1 Tibetan word is 1 character must be just " base word ", it is expressed as " absolutely empty character "
If 2 Tibetan language are 2 characters, then the method for identification means is:
First determine whether whether the 2nd be vowel, be, be expressed as " absolutely empty 2 characters of absolutely empty 1 character ", without vowel, Then table look-up 1, if it has, be judged as the structure of " upper+yl ", if, table look-up 2 in table 1 not, it is judged that for " base+under ".See flow process Figure is such as Fig. 4.
If 3 components are 3 characters, then the method for identification means is as follows:
1) first, it is determined that whether the 2nd be vowel, if "Yes", it is expressed as " absolutely empty 2 character 3 characters of absolutely empty 1 character ", If "No", then forward 2 to);
2) table look-up 3, if table has, be then judged as the structure of " upper+base+under ", without in table 3, then forward 3 to);
3) table look-up 1, it is judged that front 2 the most in Table 1, if had, judging whether the 3rd be vowel, if it is representing For " the empty 1 character 2 absolutely empty character of character 3 ", if the 3rd is not vowel, then it represents that for " empty 1 absolutely empty empty 3 words of character 2 character Symbol ";If front 2 the most in Table 1, then forward 4 to) step;
4) table look-up 1 for latter two, it may be judged whether, if it is structure is " 1 character 2 character 3 character ", if rear 2 not In Table 1, then 5 are forwarded to) step;
5) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, thus obtain " absolutely empty 1 Empty 3 characters of character 2 character " and " absolutely empty 1 absolutely empty 3 characters of character 2 character ";If the most in table 2, then forwarding 6 to) step;
6) table look-up 2 with hide word latter 2, if in table 2, be then " 1 character sky 2 character 3 characters ", if not at table 2 In, then forward 7 to) step;
7) whether interpretation the 3rd is vowel, if vowel, then it represents that for " empty absolutely empty 3 characters of 2 characters of 1 character ", if It is not vowel, then forwards 8 to) step;
8) pass through whether these three characters of interpretation are 17 kinds of character interpretation structures special in table 4.
Flow chart such as Fig. 5.
If 4 components are 4 characters, then identify that the method for structure is as follows:
1) first determining whether whether the 2nd be vowel, if vowel, then structure is " absolutely empty 2 character 3 characters of absolutely empty 1 character 4 characters ";If not vowel, then turn 2);
2) table look-up 3, it is judged that whether first 3 of character be the structure of " upper+base+under ", if it is judge that whether the 4th be Vowel, if vowel, then structure is " empty 3 characters of empty 1 character 2 character ", if not vowel, is then " empty 1 character 2 character 3 Absolutely empty 4 characters of character ";If front 3 the most in table 3, then turn 3);
3) 1 is tabled look-up, it is judged that whether first 2 of character be the structure of " upper+yl ", if it is, judge whether the 3rd be first Sound, if vowel, then structure is " absolutely empty 3 character 4 characters of empty 1 character 2 character ", if the 3rd is not vowel, then structure is " empty 1 absolutely empty empty 3 character 4 characters of character 2 character ", if hide word front 2 the most in Table 1, then turn 4);
4) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, if it is, structure For " empty 3 character 4 characters of absolutely empty 1 character 2 character ", if the 3rd is not vowel, then structure is " absolutely empty 1 character 2 character absolutely empty 3 Character 4 character " represent;If front 2 the most in table 2, then turn 5) step;
5) with rear 3 tables 3 of Tibetan word, if in table 3, then structure is " 1 character 2 character 3 character 4 character ", if do not had Have, turn 6);
6) 2, the centre of Tibetan language tables look-up 2, if had, judges whether the 4th be vowel, and if it is structure is " 1 character Empty 4 characters of empty 2 character 3 characters ", if the 4th is not vowel, then structure is " empty 2 absolutely empty 4 characters of character 3 character of 1 character "; If 2, centre is the most in table 2, then turn 7) step;
7) by judging whether the 3rd be that vowel judges structure.
Particular flow sheet such as Fig. 6.
If 5 components are 5 characters, then identify that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " 1 character 2 character 3 character 4 character Empty 5 characters ", if the 5th is not vowel, then turn 2);
2) judge whether the 4th of Tibetan language character be vowel, if vowel, table look-up 3 for first 3 of Tibetan language character, if In table 3, structure is " empty 4 character 5 characters of empty 1 character 2 character 3 character ";If Tibetan language character front 3 the most in table 3, then Tabling look-up 1 with 2, the 3 of Tibetan language character, if in Table 1, then structure is " 1 absolutely empty 4 character 5 characters of character 2 character 3 character ";As The most in Table 1, then structure is " empty 4 character 5 characters of empty 2 character 3 characters of 1 character " to fruit;If the 4th of Tibetan language character the is not unit Sound, then turn 3);
3) whether the 3rd of Tibetan language character is vowel, if it is, with first 2 table look-up 1, if in Table 1, then structure For " absolutely empty 3 character 4 character 5 characters of empty 1 character 2 character ";If the most in Table 1, then table look-up 2, if in table 2, then structure For " empty 3 character 4 character 5 characters of absolutely empty 1 character 2 character ";If the most in table 2, then structure is that " empty 2 characters of 1 character are absolutely empty 3 character 4 character 5 characters ";If the 3rd character is not vowel, then turn 4);
4) tabling look-up 3 with first 3 of Tibetan language character, if had, structure is " empty 1 absolutely empty 4 character 5 words of character 2 character 3 character Symbol ";Otherwise turn 5);
5) tabling look-up 3 with 3, the centre of Tibetan language character, if had, structure is " 1 character 2 character 3 absolutely empty 5 words of character 4 character Symbol ", otherwise turn 6);
6) tabling look-up 1 with 2, the 3 of Tibetan language character, if in table, then structure is " 1 absolutely empty empty 4 words of character 2 character 3 character Accord with 5 characters ", otherwise structure is " empty absolutely empty 4 character 5 characters of 2 character 3 characters of 1 character ".
Particular flow sheet such as Fig. 7.
If 6 components are 6 characters, then identify that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " 1 character 2 character 3 character 4 character Empty 5 character 6 characters ", if otherwise turning 2);
2) judge whether the 4th of Tibetan language character be vowel, if it is, table look-up 3 with first 3 of Tibetan language character, if In table, then structure is " empty 4 character 5 character 6 characters of empty 1 character 2 character 3 character ";If the most in table 3, tabled look-up with 2,3 1, if in table, then structure is " absolutely empty 4 character 5 character 6 characters of 1 character 2 character 3 character ";If otherwise structure is " 1 word Accord with empty 4 character 5 character 6 characters of empty 2 character 3 characters ";Otherwise structure is " 1 absolutely empty 5 character 6 words of character 2 character 3 character 4 character Symbol ".
Particular flow sheet such as Fig. 8.
If 7 characters are 7 characters, then structure is " empty 5 character 6 character 7 characters of 1 character 2 character 3 character 4 character ".
Two, Tibetan language character sequence
1, the present invention determines the order models of Tibetan language character words canonical ordering
It has been investigated that, owing to Tibetan language lexcographical order is a kind of sequence of artificial regulation, so the sequence that different dictionaries is to word Row regulation is the most otherwise varied.Obtaining hiding the dictionary sequence of word after ordering scenario analysis by dictionaries such as " hide Chinese dictionary " is point Layer circulation.As it is shown in figure 9, be divided on level:
Most crucial level i.e. ground floor is base word layer, and this is to constitute the basic of each Tibetan word and must indispensable structure Part, and be upper word adding, pre-script, down word adding, vowel, back word adding and back word adding, the second layer more respectively from the second layer to layer 7 Character on layer 7 be not constitute hide word must indispensable composition, i.e. according to hide word difference, these compositions are permissible Lack, figure represents this composition lacked with 0.
The dictionary sequence of Modern Tibetan word is, with base word as core, the character layered combination with two to seven layers, each layer again with its The character of outer layer combines successively, and wherein the consonant sequence of component is Tibetan language lexicographic ordering.Illustrate: the first character in lexcographical order isWith 0 combination of other six layers;Second word isWith the 0 of the second to layer 5, layer 6Combination;Layer 7 back word adding again must be added in After back word adding, it is also possible to think that single back word adding is with the result of back word adding 0 combination again.The like, the sequence of dictionary character should This be: (if any)
What the Tibetan language character that 2, present invention determine that sorted realizes process
(1) Tibetan language character storage mode in a computer
The structure of Tibetan collation Tibetan language to be compared syllable word, so first having to identification means after reading Tibetan word from text, Component and Tibetan language syllable word will be as elements, therefore one TibetWord structure of definition, the syllable read and identification Component is stored in a structure.Memory space is mainly used to deposit syllable and component, and Array for structural body is defined as follows:
(2) sort method of Tibetan language character
Selecting a kind of sort method to be ranked up, in hiding word sequence, circulation detailed process is as follows,
1. compare base word, i.e. the 3rd, if base word is unequal, then compare end, return the comparative result of base word, otherwise hold 2. row walks;
2. compare upper word adding, i.e. the 2nd, if upper word adding is unequal, then compare end, return the comparative result of upper word adding, Otherwise perform the 3. to walk;
3. compare pre-script, i.e. the 1st, if pre-script is unequal, then compare end, return the comparative result of pre-script, Otherwise perform the 4. to walk;
4. compare down word adding, i.e. the 4th, if down word adding is unequal, then compare end, return the comparative result of down word adding, Otherwise perform the 5. to walk;
5. compare down word adding again, i.e. the 5th, if down word adding is unequal again, then compare end, return again the comparison of down word adding As a result, otherwise perform the 6. to walk;
6. compare vowel, i.e. the 6th, if vowel is unequal, then compare end, return the comparative result of vowel, otherwise hold 7. row walks;
7. compare back word adding, i.e. the 7th, if back word adding is unequal, then compare end, return the comparative result of back word adding, Otherwise perform the 8. to walk;
8. compare back word adding again, i.e. the 8th, return again the comparative result of back word adding, compare end.
(3) it is applied in a kind of concrete sorting by computer method
The sort algorithm selected is different, and the flow chart of function is the most different, carries out Tibetan language character sequence with " merger sequence " Flow chart is as shown in Figure 10.

Claims (8)

1. the method for a Tibetan language character dictionary based on Tibetan language character component identification technology sequence, it is characterised in that include with The structure word structure of Modern Tibetan word is analyzed by lower step: S1. according to the Tibetan language syntax, show that Tibetan language has 48 kinds of basic structures;S2. Priority treatment special construction, first determines whether whether to contain in this character special component syllable, if there being special component, according still further to this Character number in structure and judge the structure of this special component with or without vowel;S3. the combination block of longitudinally fixed for Tibetan language superposition As a disposed of in its entirety, according to the structure of Tibetan language, " upper word adding+base word ", " base word+down word adding ", " upper word adding+base word+under Add word " as fixing structure recognition Tibetan language character component, current syllable to be judged is searched in these structures, if This structure finds the structure that just can judge this syllable very well, soon, then sets up 3 tables, be used for processing fixed structure and knowledge Other spcial character;S4. there are some to have ambiguity the Tibetan word without vowel, three components not having superposition, resettle 1 table Ambiguous 14 characters are carried out special handling;S5. judge component from Tibetan language character with or without the position of vowel and vowel, enter Row component splits, by the component of Tibetan language character that identifies according to " down word adding-the unit of pre-script-upper word adding-base word-down word adding-again The back word adding of sound-back word adding-again " eight parts place;S6. the order models of Tibetan language character words canonical ordering, most crucial level are determined I.e. ground floor is base word layer, and is upper word adding, pre-script, down word adding, vowel, back word adding and more respectively from the second layer to layer 7 Back word adding;S7. one TibetWord structure of definition, is stored in the component of the syllable read and identification in one structure, deposits Storage space is mainly used to deposit syllable and component, selects a kind of sort method to be ranked up.
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 1 sequence Method, it is characterised in that S5 is specific as follows: when hide word be 1 character must be just " base word ", be expressed as " absolutely empty character ".
A kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 1 and 2 sequence Method, it is characterised in that S5 is specific as follows: when Tibetan language is 2 characters, then the method for identification means is:
First determine whether whether the 2nd be vowel, be, be expressed as " absolutely empty 2 characters of absolutely empty 1 character ", without vowel, then look into Table 1, the structure of " upper+yl " if it has, be judged as, if, table look-up 2 in table 1 not, it is judged that for " base+under ".
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 3 sequence Method, it is characterised in that S5 is specific as follows, when component is 3 characters, then the method for identification means is as follows:
1) first, it is determined that whether the 2nd be vowel, if "Yes", it is expressed as " absolutely empty 2 character 3 characters of absolutely empty 1 character ", if "No", then forward 2 to);
2) table look-up 3, if table has, be then judged as the structure of " upper+base+under ", without in table 3, then forward 3 to);
3) table look-up 1, it is judged that front 2 the most in Table 1, if had, judging whether the 3rd be vowel, being if it is expressed as " the empty 1 character 2 absolutely empty character of character 3 ", if the 3rd is not vowel, then it represents that for " empty 1 absolutely empty empty 3 characters of character 2 character "; If front 2 the most in Table 1, then forward 4 to) step;
4) table look-up 1 for latter two, it may be judged whether, if it is structure is " 1 character 2 character 3 character ", if rear 2 not at table 1 In, then forward 5 to) step;
5) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, thus obtain " absolutely empty 1 character Empty 3 characters of 2 characters " and " absolutely empty 1 absolutely empty 3 characters of character 2 character ";If the most in table 2, then forwarding 6 to) step;
6) tabling look-up 2 with hide word latter 2, if in table 2, be then " empty 2 character 3 characters of 1 character ", if not having the most in table 2, then Forward 7 to) step;
7) whether interpretation the 3rd is vowel, if vowel, then it represents that for " empty absolutely empty 3 characters of 2 characters of 1 character ", if not Vowel, then forward 8 to) step;
8) pass through whether these three characters of interpretation are 17 kinds of character interpretation structures special in table 4.
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 4 sequence Method, it is characterised in that S5 is specific as follows, when component is 4 characters, then identifies that the method for structure is as follows:
1) first determining whether whether the 2nd be vowel, if vowel, then structure is " absolutely empty 2 character 3 character 4 words of absolutely empty 1 character Symbol ";If not vowel, then turn 2);
2) 3 are tabled look-up, it is judged that whether first 3 of character be the structure of " upper+base+under ", if it is judges whether the 4th be first Sound, if vowel, then structure is " empty 3 characters of empty 1 character 2 character ", if not vowel, is then " empty 1 character 2 character 3 word Accord with absolutely empty 4 characters ";If front 3 the most in table 3, then turn 3);
3) 1 is tabled look-up, it is judged that whether first 2 of character be the structure of " upper+yl ", if it is, judge whether the 3rd be vowel, If vowel, then structure is " absolutely empty 3 character 4 characters of empty 1 character 2 character ", if the 3rd is not vowel, then structure is " empty 1 absolutely empty empty 3 character 4 characters of character 2 character ", if hide word front 2 the most in Table 1, then turn 4);
4) table look-up 2 with hide word first 2, if in table 2, then judge whether the 3rd be vowel, if it is, structure be " sky Empty 3 character 4 characters of empty 1 character 2 character ", if the 3rd is not vowel, then structure is " absolutely empty 1 absolutely empty 3 characters 4 of character 2 character Character " represent;If front 2 the most in table 2, then turn 5) step;
5) with rear 3 tables 3 of Tibetan word, if in table 3, then structure is " 1 character 2 character 3 character 4 character ", without then Turn 6);
6) 2, the centre of Tibetan language tables look-up 2, if had, judges whether the 4th be vowel, and if it is structure is " 1 character sky 2 Empty 4 characters of character 3 character ", if the 4th is not vowel, then structure is " empty 2 absolutely empty 4 characters of character 3 character of 1 character ";If 2, centre the most in table 2, then turns 7) step;
7) by judging whether the 3rd be that vowel judges structure.
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 5 sequence Method, it is characterised in that S5 is specific as follows, when component is 5 characters, then identifies that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " empty 5 words of 1 character 2 character 3 character 4 character Symbol ", if the 5th is not vowel, then turn 2);
2) judge whether the 4th of Tibetan language character be vowel, if vowel, table look-up 3, if at table 3 for first 3 of Tibetan language character In, structure is " empty 4 character 5 characters of empty 1 character 2 character 3 character ";If Tibetan language character front 3 the most in table 3, then with hide Tabling look-up 1 for the 2 of Chinese character, 3, if in Table 1, then structure is " 1 absolutely empty 4 character 5 characters of character 2 character 3 character ";If no In Table 1, then structure is " empty 4 character 5 characters of empty 2 character 3 characters of 1 character ";If the 4th of Tibetan language character the is not vowel, Then turn 3);
3) whether the 3rd of Tibetan language character is vowel, if it is, table look-up 1 with first 2, if in Table 1, then structure is " empty Absolutely empty 3 character 4 character 5 characters of 1 character 2 character ";If the most in Table 1, then tabling look-up 2, if in table 2, then structure is " empty Empty 3 character 4 character 5 characters of empty 1 character 2 character ";If the most in table 2, then structure is " empty absolutely empty 3 characters of 2 characters of 1 character 4 character 5 characters ";If the 3rd character is not vowel, then turn 4);
4) tabling look-up 3 with first 3 of Tibetan language character, if had, structure is " empty 1 absolutely empty 4 character 5 characters of character 2 character 3 character "; Otherwise turn 5);
5) tabling look-up 3 with 3, the centre of Tibetan language character, if had, structure is " 1 character 2 character 3 absolutely empty 5 characters of character 4 character ", Otherwise turn 6);
6) tabling look-up 1 with 2, the 3 of Tibetan language character, if in table, then structure is " 1 absolutely empty empty 4 characters 5 of character 2 character 3 character Character ", otherwise structure is " empty absolutely empty 4 character 5 characters of 2 character 3 characters of 1 character ".
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 6 sequence Method, it is characterised in that S5 is specific as follows, when component is 6 characters, then identifies that the method for structure is as follows:
1) judge whether the 5th of Tibetan language character be vowel, if it is, structure is " empty 5 words of 1 character 2 character 3 character 4 character Accord with 6 characters ", if otherwise turning 2);
2) judge whether the 4th of Tibetan language character be vowel, if it is, table look-up 3, if at table with first 3 of Tibetan language character In, then structure is " empty 4 character 5 character 6 characters of empty 1 character 2 character 3 character ";If the most not in table 3, table look-up 1 with 2,3, as Fruit is in table, then structure is " absolutely empty 4 character 5 character 6 characters of 1 character 2 character 3 character ";If otherwise structure is " 1 character sky 2 Empty 4 character 5 character 6 characters of character 3 character ";Otherwise structure is " 1 absolutely empty 5 character 6 characters of character 2 character 3 character 4 character ".
The side of a kind of Tibetan language character dictionary based on Tibetan language character component identification technology the most according to claim 7 sequence Method, it is characterised in that S5 is specific as follows, when component is 7 characters, then structure is " empty 5 characters of 1 character 2 character 3 character 4 character 6 character 7 characters ".
CN201610662188.5A 2016-08-12 2016-08-12 The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence Pending CN106250357A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610662188.5A CN106250357A (en) 2016-08-12 2016-08-12 The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610662188.5A CN106250357A (en) 2016-08-12 2016-08-12 The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence

Publications (1)

Publication Number Publication Date
CN106250357A true CN106250357A (en) 2016-12-21

Family

ID=57591745

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610662188.5A Pending CN106250357A (en) 2016-08-12 2016-08-12 The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence

Country Status (1)

Country Link
CN (1) CN106250357A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818673A (en) * 2021-01-28 2021-05-18 青海民族大学 Tibetan character component identification method based on segmentation and structure
CN112818640A (en) * 2021-01-28 2021-05-18 青海民族大学 Tibetan ordering method based on hash function

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818673A (en) * 2021-01-28 2021-05-18 青海民族大学 Tibetan character component identification method based on segmentation and structure
CN112818640A (en) * 2021-01-28 2021-05-18 青海民族大学 Tibetan ordering method based on hash function
CN112818673B (en) * 2021-01-28 2023-08-22 青海民族大学 Tibetan character component recognition method based on segmentation and structure

Similar Documents

Publication Publication Date Title
Neeleman et al. Radical pro drop and the morphology of pronouns
Durrani et al. Urdu word segmentation
CN105426711A (en) Similarity detection method of computer software source code
Mohamed et al. Arabic Part of Speech Tagging.
Aswani et al. A hybrid approach to align sentences and words in English-Hindi parallel corpora
CN106250357A (en) The method of Tibetan language character dictionary based on Tibetan language character component identification technology sequence
Alnefaie et al. Automatic minimal diacritization of Arabic texts
Kaur et al. Hybrid approach for spell checker and grammar checker for Punjabi
Kaur et al. Spell checker for Punjabi language using deep neural network
Abu Bakar et al. NUWT: Jawi-specific Buckwalter corpus for Malays word tokenization
Randhawa et al. Study of spell checking techniques and available spell checkers in regional languages: a survey
Belaid et al. Part-of-speech tagging for table of contents recognition
Aswani et al. Aligning words in English-Hindi parallel corpora
Mijlad et al. Arabic text diacritization: Overview and solution
Baruah et al. Design and development of soundex for assamese language
Bouma et al. Syllabification in Middle Dutch
Das et al. Design and implementation of a spell checker for Assamese
Hanu et al. Aspects Revealing the Orthography and Punctuation Impact in Printed Romanian: A Literary Corpus Based Study
Seyyed et al. PTokenizer: POS tagger tokenizer
Renner English cum, a borrowed coordinator turned complex-compound marker
Hoque et al. Coding system for bangla spell checker
Lehal et al. Automatic Bilingual Legacy-Fonts Identification and Conversion System.
Cheng et al. SpellChef: spelling checker and corrector for Filipino
Asahiah et al. Diacritic-aware yorùbá spell checker
Akash et al. An approach to sort Unicode based Bengali text using Trie

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161221

RJ01 Rejection of invention patent application after publication