CN104408037A

CN104408037A - Tibetan text vector model representation method

Info

Publication number: CN104408037A
Application number: CN201410734163.2A
Authority: CN
Inventors: 才智杰; 才让卓玛
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-12-05
Filing date: 2014-12-05
Publication date: 2015-03-11

Abstract

The invention relates to Tibetan information processing technical field and provides a Tibetan text vector model representation method. The Tibetan text vector model representation method comprises the representation of a vector model of Tibetan characters and the representation of a vector model of Tibetan character strings, wherein the Tibetan characters are represented by elements in a vector set which is equal to the formula, every character string is mainly formed by a plurality of Tibetan characters, every Tibetan character is corresponding to a one-dimensional vector which comprises seven components according to the vector model of the Tibetan characters, the vector model of the Tibetan character strings is obtained through the transposed representation of the corresponding one-dimensional vector of every Tibetan character in the corresponding character string in turn, and the vector model is equal to the formula. According to the Tibetan text vector model representation method, a Tibetan text is represented by the vector models, the Tibetan text is converted into the vector models which are easy to process, and accordingly the calculations and the operation on the text are simplified and the Tibetan information processing can be efficient.

Description

The vector model method for expressing of Tibetan language text

Technical field

The present invention relates to Tibetan information processing technology field, particularly the vector model method for expressing of Tibetan language text.

Background technology

Tibetan is one of ancient nationality of China, is mainly distributed in five provinces and regions such as Tibet, Qinghai, Gansu, Yunnan, Sichuan.There are long history, magnificent culture in Tibetan.Tibetan language is the writing system of Tibetan language, since the initiative of 7th century of Christian era, through three fairly large determining, become better and approaching perfection day by day, formulated the Tibetan language syntax, poured into the wisdom of successive dynasties many great thinkers of the past, for the abundant of Tibetan language and development serve positive effect, erect the bridge of Tibetan and other brother nationality's cultural exchanges, constructed the cultural defence line in motherland border area, created the documents and materials that Books are numerous.The user of Tibetan language word is distributed in China Tibet, Qinghai, Sichuan, Gansu, Yunnan, and the countries and regions such as Pakistan, India, Nepal, Bhutan, and use comparatively extensive, user is more than 7,000,000 people.

Tibetan information processing is the cross discipline of hiding linguistics and computer technology, and be the important component part of hiding speech research, in today of information age, Tibetan information processing is to China's politics, economy, cultural development important in inhibiting.Therefore, under the moving and inspiring of the Party's policies toward ethnic minorities and native language policy, Tibetan information processing technology obtains development at full speed, hide the aspects such as literal code, character library, input method, participle and Corpus Construction and achieve gratifying achievement in research, carry out Tibetan information processing for profound level and established solid foundation.Comparatively speaking, Tibetan information processing is started late, and the work of Tibetan information processing mainly concentrates in the research of word coding, character library, along with the determination of country, international standard finishes a few years ago.The research work of Tibetan information processing in recent years progressively enters the Tibetan language text analyzing stages such as Tibetan language lexical analysis, syntactic analysis, semantic analysis and pragmatic analysis, with equipment analysis and understand the method for expressing that Tibetan language text need consider text.The reasonable representation of Tibetan language text is conducive to every computing and the operation of Tibetan language text, particularly in recent years along with the arrival of large data age, the presentation technology of Tibetan language text is particularly important, but the research of the presentation technology of Tibetan language text is in blank and starting stage at present substantially.

Tibetan language text is made up of Tibetan language word, Tibetan language numeric character and Tibetan language punctuation mark.The alphabetic writing that Tibetan language word is made up of 30 consonants and 4 vowels, structure is made up of base word (basic consonant), pre-script, upper word adding, down word adding, back word adding, heavy back word adding and vowel, wherein pre-script, base word, back word adding and heavy back word adding are laterally spelt, and on the vertical direction at base word place, the longitudinal direction of upper word adding, base word, down word adding and vowel also may be had to spell.Tibetan language numeric character represent the time, sequentially, the age, long children etc., most popularly to have " " etc.Tibetan language punctuation mark be used in the text separating word or represent pause, the nature and role of the tone and word, mainly contain syllabic sign " ", Dan Chuifu " ", two vertical symbol " " and four vertical symbols " " etc., syllabic sign is used for separating the syllable in text, is placed between two syllables during use, and Dan Chuifu, two vertical symbol and four symbol that hangs down is placed on the ends of phrase or sentence, represents the end of the end of a phrase or the end of a sentence or chapters and sections.Therefore Tibetan language word is the chief component of Tibetan language text.

The present invention establishes the vector model method for expressing of Tibetan language text, comprises the vector model method for expressing of Tibetan language word and the vector model method for expressing of Tibetan language character string.Represent Tibetan language text with vector model, Tibetan language text-converted is become easy-to-handle vector model, simplify various computing and the operation of Tibetan language text, make Tibetan information processing more efficient.

Summary of the invention

Technical matters: for the ease of every computing and the operation of Tibetan language text, make the equipment analysis such as Tibetan language lexical analysis, syntactic analysis, semantic analysis and pragmatic analysis and understand Tibetan language text to provide a kind of simple and direct efficient method for expressing, the present invention establishes the vector model method for expressing of Tibetan language text.The present invention needs one of problem solved to be the vector model method for expressing setting up a kind of Tibetan language word; The present invention needs two of the problem solved to be the vector model method for expressing setting up a kind of Tibetan language character string.

The vector model method for expressing of Tibetan language word, Tibetan language word vector set T={ <t ₁, t ₂, t ₃, t ₄, t ₅, t ₆, t ₇>|0≤t ₁≤ 5,0≤t ₂≤ 3,0 < t ₃≤ 30,0≤t ₄≤ 4,0≤t ₅≤ 4,0≤t ₆≤ 10,0≤t ₇≤ 2 } element representation in; The vector model method for expressing of Tibetan language character string, Tibetan language character string is made up of several Tibetan language words, according to the vector model method for expressing of Tibetan language word, each Tibetan language word correspondence one, containing the one-dimensional vector of 7 components, obtains Tibetan language word each in Tibetan language character string the bivector model Γ=(t of Tibetan language character string successively with the transposed representation of its one-dimensional vector _ij).Represent Tibetan language text with vector model, Tibetan language text-converted is become easy-to-handle vector model, simplify various computing and the operation of Tibetan language text, make Tibetan information processing more efficient.

Technical scheme: the vector model method for expressing of Tibetan language text of the present invention comprises:

1. the vector model method for expressing of Tibetan language word

The alphabetic writing of Tibetan language word to be a kind of with consonant be main member, comprise 30 consonants and 4 vowels, structure is made up of base word (basic consonant), pre-script, upper word adding, down word adding, back word adding, heavy back word adding and vowel, it not only has laterally spelling property, also there is longitudinally spelling property simultaneously, wherein pre-script, base word, back word adding and heavy back word adding are laterally spelt, and on the vertical direction at base word place, the longitudinal direction of upper word adding, base word, down word adding and vowel also may be had to spell, shown in accompanying drawing 1.30 consonants all can do base word, have 10 can do back word adding in 30 consonants, 5 in back word adding can be done pre-script again, and in back word adding, 2 can be done heavy back word adding, 3 in 30 consonants can be done upper word adding, and in 30 consonants, 4 can be done down word adding.

A complete Tibetan language word is made up of 1-7 component, combines order into syllables 7 components such as pre-script, upper word adding, base word, down word adding, vowel, back word adding and heavy back word adding are used t successively by tradition ₁, t ₂, t ₃, t ₄, t ₅, t ₆and t ₇represent, then Tibetan language word can be expressed as the form shown in accompanying drawing 2.

The pre-script of Tibetan language word can not exist or get " one of ", therefore t ₁6 kinds of different values can be got, indicate without pre-script with 0, with 1-5 represent successively 5 pre-scripts " "; In like manner, t ₂4 kinds of different values can be got, indicate without upper word adding with 0, with 1-3 represent successively 3 upper word addings " "; t ₃can get 30 kinds of different values, represent 30 base words (basic consonant) successively with 1-30, base word must exist; t ₄5 kinds of different values can be got, indicate without down word adding with 0, with 1-4 represent successively 4 down word addings " "; t ₅5 kinds of different values can be got, indicate without vowel with 0, with 1-4 represent successively 4 vowels " "; t ₆11 kinds of different values can be got, indicate without back word adding with 0, with 1-10 represent successively 10 back word addings " "; t ₇3 kinds of different values can be got, indicate without heavy back word adding with 0, with 1-2 represent respectively 2 heavy back word addings " "; Therefore, vector model (Vector model of Tibetan characters, the i.e. VMTC) T={ <t of Tibetan language word can be used in Tibetan language word ₁, t ₂, t ₃, t ₄, t ₅, t ₆, t ₇>|0≤t ₁≤ 5,0≤t ₂≤ 3,0 < t ₃≤ 30,0≤t ₄≤ 4,0≤t ₅≤ 4,0≤t ₆≤ 10,0≤t ₇≤ 2 } element representation in, shown in accompanying drawing 3.

2. the vector model method for expressing of Tibetan language character string

Tibetan language character string is made up of Tibetan language word, Tibetan language numeric character and Tibetan language punctuation mark, Tibetan language numeric character represent the time, sequentially, the age, long children, and for metering, its quantity is few; Punctuation mark is only used to word in separating character string or phrase or sentence.Therefore, Tibetan language word is the main composition of Tibetan language character string.In the vector model VMTC method for expressing of Tibetan language word, the corresponding one-dimensional vector containing 7 components of each Tibetan language word, using the one-dimensional vector of Tibetan language word each in character string as arranging to obtain vector model (Vector model of Tibetan String, the i.e. VMTS) Γ=(t of Tibetan language character string _ij).Shown in accompanying drawing 5.

Accompanying drawing explanation

Fig. 1 hides character assumption diagram.

Fig. 2 is the formalization representation figure of Tibetan language word.

Fig. 3 is that the vector model of Tibetan language word represents process flow diagram.

Fig. 4 is that the vector model of Tibetan language word represents instance graph.

Fig. 5 is that the vector model of Tibetan language character string represents process flow diagram.

Fig. 6 is Tibetan language word component decomposing schematic representation.

Embodiment

The present invention can be applicable to the expression of Tibetan language text, as the expression of Tibetan language text in lexical analysis and syntactic analysis.Described Tibetan language text comprises Tibetan language word and Tibetan language character string.Now by reference to the accompanying drawings embodiments of the invention are described.

Embodiment one

With reference to accompanying drawing 3, the vector model method for expressing of the word of Tibetan language described in the present embodiment comprises Tibetan language word component and decomposes, and component component value is determined and Tibetan language word vector model expression etc.Specific implementation process is: the component first decomposing Tibetan language word, then determines the component value t that each component is corresponding _i(i=1,2,3,4,5,6,7), finally according to gained component value t _iset up the vector model <t of Tibetan language word ₁, t ₂, t ₃, t ₄, t ₅, t ₆, t ₇>.

With reference to accompanying drawing 4, Tibetan language word " " to be divided by component and solve pre-script , upper word adding , base word , down word adding , vowel , back word adding with heavy back word adding ; Pre-script the 3rd in pre-script, thus t ₁=3; Upper word adding the 3rd in upper word adding, thus t ₂=3; Base word the 3rd in base word, thus t ₃=3; Down word adding the 2nd in down word adding, thus t ₄=2; Vowel the 1st in vowel, thus t ₅=1; Back word adding the 1st in back word adding, thus t ₆=1; Heavy back word adding the 2nd in heavy back word adding, thus t ₇=2; Therefore, Tibetan language word " " vector model be <3,3,3,2,1,1,2>.

Sometimes occur in Tibetan language word two down word addings (as " ") situation, now only vector model T need be extended to T*={ <t ₁, t ₂, t ₃, t ₄, t ₅, t ₆, t ₇, t ₈>|0≤t ₁≤ 5,0≤t ₂≤ 3,0 < t ₃≤ 30,0≤t ₄≤ 4,0≤t ₅≤ 4,0≤t ₆≤ 10,0≤t ₇≤ 2,0≤t ₈≤ 4 }, wherein t ₈represent heavy down word adding, such as Tibetan language word " " availability vector <0,0,3,2,0,0,0,4> represents.

Embodiment two

With reference to accompanying drawing 5, the vector model method for expressing of the character string of Tibetan language described in the present embodiment comprises Tibetan Text region, Tibetan language word component decomposes, component component value is determined, Tibetan language word vector model represents and Tibetan language character string vector model representation etc., is expand further on the basis of embodiment one and obtain.

With reference to accompanying drawing 5, Tibetan language character string " " vector model of (having good luck) represents that process is as follows:

1) identifier word: Tibetan language character string " " by Tibetan language word " ", " ", " ", " ", " ", " ", " ", " ", syllable point " " and Dan Chuifu " " composition, word is separated in the effect of syllable point, and Dan Chuifu separates sentence, can identify Tibetan language word according to syllable point and Dan Chuifu.

2) degradable member: the decomposition of Tibetan language word component completes with " class's intelligence reaches Modern Tibetan word component decomposing system V2.0 ", component decomposes concrete grammar: start " class's intelligence reaches Modern Tibetan word component decomposing system V2.0 ", open the file that will decompose, click " generate and decompose file ", component can be completed decompose, as shown in Figure 6.Character string " " in the component decomposition result of Tibetan language word be: " " by pre-script " ", base word " " and down word adding " " form, " " by base word " ", vowel " ", back word adding " " form, " " by pre-script " ", base word " ", vowel " " form, " " by base word " ", vowel " ", back word adding " ", heavy back word adding " " form, " " by base word " ", vowel " ", back word adding " " form, " " by base word " ", vowel " ", back word adding " " form, " " by base word " ", vowel " ", back word adding " ", heavy back word adding " " form, " " by base word " " form.

3) component component value is determined: to each Tibetan language word, t when not existing according to its pre-script ₁=0, pre-script be " " time t ₁=1, pre-script be " " time t ₁=2, pre-script be " " time t ₁=3, pre-script be " " time t ₁=4, pre-script be " " time t ₁=5; T when upper word adding does not exist ₂=0, upper word adding be " " time t ₂=1, upper word adding be " " time t ₂=2, upper word adding be " " time t ₂=3; With same method, t is determined in the position of base word in 30 consonants according to forming Tibetan language word ₃value, according to the down word adding forming Tibetan language word 4 down word addings " " in position determine t ₄value, according to the vowel forming Tibetan language word 4 vowels " " in position determine t ₅value, according to the back word adding forming Tibetan language word 10 back word addings " " in position determine t ₆value, according to the heavy back word adding forming Tibetan language word 2 heavy back word addings " " in position determine t ₇value.

4) vector model of Tibetan language word is set up: according to 3) the component component value determined sets up the vector model of each Tibetan language word, Tibetan language word " " vector model be T ₁=<3,0,1,2,0,0,0>, Tibetan language word " " vector model be T ₂=<0,0,27,0,1,10,0>, Tibetan language word " " vector model be T ₃=<3,0,11,0,3,0,0>, Tibetan language word " " vector model be T ₄=<0,0,26,0,3,1,2>, Tibetan language word " " vector model be T ₅=<0,0,14,0,2,4,0>, Tibetan language word " " vector model be T ₆=<0,0,28,0,2,6,0>, Tibetan language word " " vector model be T ₇=<0,0,18,0,4,2,1>, Tibetan language word " " vector model be T ₈=<0,0,13,0,0,0,0>.

5) vector model of Tibetan language character string is set up: using the transposition of the vector model of Tibetan language word each in Tibetan language character string as arranging to obtain the vector model Γ of Tibetan language character string.By 4) can obtain Tibetan language character string " " vector model be:

Γ _7×8=

Claims

1. the vector model method for expressing of Tibetan language word, is characterized in that:

1) a complete Tibetan language word is made up of 1-7 component, combines order into syllables 7 components such as pre-script, upper word adding, base word, down word adding, vowel, back word adding and heavy back word adding are used t successively by tradition ₁, t ₂, t ₃, t ₄, t ₅, t ₆and t ₇represent, then Tibetan language word availability vector <t ₁, t ₂, t ₃, t ₄, t ₅, t ₆, t ₇> represents;

2) pre-script of Tibetan language word can not exist or get " one of ", therefore t ₁have 6 kinds of different values, indicate without pre-script with 0, with 1-5 represent successively 5 pre-scripts " "; In like manner, t ₂have 4 kinds of different values, indicate without upper word adding with 0, with 1-3 represent successively 3 upper word addings " "; t ₃have 30 kinds of different values, represent 30 base words (basic consonant) successively with 1-30, base word must exist; t ₄have 5 kinds of different values, indicate without down word adding with 0, with 1-4 represent successively 4 down word addings " "; t ₅have 5 kinds of different values, indicate without vowel with 0, with 1-4 represent successively 4 vowels " "; t ₆have 11 kinds of different values, indicate without back word adding with 0, with 1-10 represent successively 10 back word addings " "; t ₇having 3 kinds of different values, indicating without adding word again with 0, with 1-2 represent respectively 2 heavy back word addings " "; Therefore, Tibetan language word can use vector model T={ <t ₁, t ₂, t ₃, t ₄, t ₅, t ₆, t ₇>|0≤t ₁≤ 5,0≤t ₂≤ 3,0 < t ₃≤ 30,0≤t ₄≤ 4,0≤t ₅≤ 4,0≤t ₆≤ 10,0≤t ₇≤ 2 } element representation in;

3) sometimes there is the situation of two down word addings in Tibetan language word, now only the vector model T of Tibetan language word need be extended to T*={ <t ₁, t ₂, t ₃, t ₄, t ₅, t ₆, t ₇, t ₈>|0≤t ₁≤ 5,0≤t ₂≤ 3,0 < t ₃≤ 30,0≤t ₄≤ 4,0≤t ₅≤ 4,0≤t ₆≤ 10,0≤t ₇≤ 2,0≤t ₈≤ 4 }, wherein t ₈represent heavy down word adding.

2. the vector model method for expressing of Tibetan language character string, is characterized in that:

Tibetan language character string is made up of Tibetan language word, Tibetan language numeric character and Tibetan language punctuation mark, Tibetan language numeric character represent the time, sequentially, the age, long children, and for metering, its quantity is few, punctuation mark is only used to word in separating character string or phrase or sentence, therefore, Tibetan language word is the main composition of Tibetan language character string; The corresponding one-dimensional vector containing 7 components of each Tibetan language word in the vector model method for expressing of Tibetan language word, using the one-dimensional vector of each Tibetan language word in character string as arranging to obtain the vector model Γ=(t of Tibetan language character string _ij).