CN106294311A - A kind of Tibetan language tone Forecasting Methodology and system - Google Patents

A kind of Tibetan language tone Forecasting Methodology and system Download PDF

Info

Publication number
CN106294311A
CN106294311A CN201510332794.6A CN201510332794A CN106294311A CN 106294311 A CN106294311 A CN 106294311A CN 201510332794 A CN201510332794 A CN 201510332794A CN 106294311 A CN106294311 A CN 106294311A
Authority
CN
China
Prior art keywords
word
unit
tone
speech
recognitions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510332794.6A
Other languages
Chinese (zh)
Other versions
CN106294311B (en
Inventor
朱荣华
祖漪清
王影
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201510332794.6A priority Critical patent/CN106294311B/en
Publication of CN106294311A publication Critical patent/CN106294311A/en
Application granted granted Critical
Publication of CN106294311B publication Critical patent/CN106294311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a kind of Tibetan language tone Forecasting Methodology and system, including receiving pending Tibetan language text;Pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and the candidate part of speech of each word unit;Determine the type of each word unit;According to institute's predicate unit context information in described pending Tibetan language text and the type of institute's predicate unit, determine the part of speech of institute's predicate unit;Context information according to institute's predicate unit and the part of speech of word unit, determine the tone information of institute's predicate unit;Tone information according to each word unit obtains the tone information of described pending Tibetan language text.The present invention, during prediction word unit tone, not only allows for the context information of word unit, it is also contemplated that the impact on tone of the part of speech of word unit, so that the liaison modified tone in the continuous flow of Tibetan language is more natural.

Description

A kind of Tibetan language tone Forecasting Methodology and system
Technical field
The present invention relates to natural language processing field, be specifically related to a kind of Tibetan language tone Forecasting Methodology and system.
Background technology
The most prominent phonetic feature of Han-Tibetan family language is exactly tone, the research of Tibetan language tone always with It it is all one of the emphasis of Sino-Tibetan speech research.Briefly, tone is exactly the height of sound.
Tibetan language includes Lhasa words, health bar words, Anduo County's words etc., wherein based on Lhasa words.Continuous language is talked about in Lhasa In stream, there is a lot of change in same single syllable tune type in different context environmentals, same polysyllabic word is not With there is also different tone tune types in context environmental, owing to the form of expression of tone is different, word unit is also Often taking on different parts of speech, the change of tone and the dependency of part of speech are relatively big, and the change of tune type can be serious The understanding that impact is semantic.If the tone tune type of Tibetan language can not be predicted exactly, it will reduce Tibetan voice system The application effect of system.
Existing Tibetan language tone Forecasting Methodology is typically all rule-based method, i.e. initial consonant, rhythm to syllable Mother classifies, and according to sorted initial consonant and the combined situation of simple or compound vowel of a Chinese syllable, obtains sound by looking into tone tune type table The tone tune type of joint.But, existing method is sufficiently complete, at Tibetan language to the analysis of tone tune type change mechanism In actual continuous flow, tone tune type often occurs that liaison modifies tone, and the most same word unit is upper and lower in difference Under literary composition environment, part of speech and tone tune type can change, and existing methodical rule coverage is not complete, are difficult to solve Tibetan language Liaison modified tone problem in flow continuously, thus cause tone prediction effect undesirable, as Tibetan voice synthesizes Naturalness reduce, even affecting intelligibility.
Summary of the invention
The embodiment of the present invention provides a kind of Tibetan language tone Forecasting Methodology and system, and solution is continuous Tibetan language reality In flow, tone tune type often occurs that liaison modifies tone, the most same word unit under different context environmentals, The problem that part of speech and tone tune type can change, so that the liaison modified tone in the continuous flow of Tibetan language is more natural.
To this end, the embodiment of the present invention following technical scheme of offer:
A kind of Tibetan language tone Forecasting Methodology, including:
Receive pending Tibetan language text;
Pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and the candidate part of speech of each word unit;
Determine the type of each word unit;
According to institute's predicate unit context information in described pending Tibetan language text and institute's predicate list The type of unit, determines the part of speech of institute's predicate unit;
Context information according to institute's predicate unit and the part of speech of word unit, determine the sound of institute's predicate unit Tune information;
Tone information according to each word unit obtains the tone information of described pending Tibetan language text.
Preferably, the type of institute's predicate unit includes following any one or more: many tone recognitions word, void Word, affixe, conventional word;
Described method also includes:
By collecting a large amount of Tibetan language data construct word cell type dictionaries, institute's predicate cell type dictionary includes: Function word dictionary, affixe dictionary, many tone recognitions dictionary;
The described type determining each word unit includes:
By being mated with each word cell type dictionary built in advance by each word unit, determine each word list The type of unit.
Preferably, the described context information according to institute's predicate unit in described pending Tibetan language text And the type of institute's predicate unit, determine that the part of speech of institute's predicate unit includes:
For function word, according to the context information of described function word, statistical modeling method is used to determine described The part of speech of function word;
For affixe, all affixes use identical part of speech;
For many tone recognitions word, using the segmentation sequence between function word before and after it as current many tone recognitions word Participle fragment residing for unit, according to its place context information, uses statistical modeling method to determine described The part of speech of many tone recognitions word;
For conventional word, all conventional word use identical part of speech.
Preferably, the context information of described many tone recognitions word includes:
(1) the candidate part-of-speech information of many tone recognitions word, described candidate part-of-speech information in described participle fragment During for carrying out participle according to dictionary for word segmentation, the candidate part of speech of each word unit obtained;
(2) current many tone recognitions word position in described participle fragment;
(3) the total number of word unit in current participle fragment;
(4) in the sentence of described many tone recognitions word place, the function word number before many tone recognitions word;
(5) function word part of speech nearest before many tone recognitions word;
(6) in the sentence of described many tone recognitions word place, the function word number after many tone recognitions word;
(7) nearest function word part of speech after many tone recognitions word;
(8) whether current many tone recognitions word contains affixe part.
Preferably, the described context information according to word unit and word unit part of speech, it was predicted that word unit sound Stealthily substitute and include:
If current word unit is function word, then the initial consonant of function word is set to low class initial consonant;
If current word unit is many tone recognitions word, and its part of speech is verb or adjective, then with syllable Split described many tone recognitions word for unit, obtain each syllable unit of many tone recognitions word;
If current word unit comprises affixe part, then the affixe part depending on word unit is marked, Separate with stem part, it is thus achieved that the syllable unit of affixe part;
Load bearing unit with syllable as tone is to each sound in all word unit in described pending Tibetan language text Joint carries out tone prediction;
All affixes are set to weak reading.
A kind of Tibetan language tone prognoses system, including:
Receiver module, is used for receiving pending Tibetan language text;
Word-dividing mode, for pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and each word list The candidate part of speech of unit;
Determination type module, for determining the type of each word unit;
Part of speech determines module, for according to institute's predicate unit context ring in described pending Tibetan language text Environment information and the type of institute's predicate unit, determine the part of speech of institute's predicate unit;
Word unit tone determines module, for according to the context information of institute's predicate unit and word unit Part of speech, determines the tone information of institute's predicate unit;
Text tone acquisition module, obtains described pending Tibetan language and literature for the tone information according to each word unit This tone information.
Preferably, described system also includes:
Dictionary creation module, for by collecting a large amount of Tibetan language data construct word cell type dictionaries, institute's predicate Cell type dictionary includes: function word dictionary, affixe dictionary, many tone recognitions dictionary;
Described determination type module, specifically for by by each word unit and each word unit class built in advance Type dictionary mates, and determines the type of each word unit.
Preferably, described part of speech determines that module includes:
Function word part of speech determines unit, for the context information according to described function word, uses statistical modeling Method determines the part of speech of described function word;
Affixe part of speech determines unit, for all affixes are defined as same part of speech;
Many tone recognitions word part of speech determines unit, and before and after using it, the segmentation sequence between function word is as currently Participle fragment residing for many tone recognitions word unit, according to its place context information, uses statistical modeling Method determines the part of speech of described many tone recognitions word;
Conventional word part of speech determines unit, for all conventional word are defined as same part of speech.
Preferably, institute's predicate unit tone determines that module includes:
Pretreatment unit, is set to low class initial consonant by the initial consonant of function word;It is verb or adjectival many sound by part of speech Mode transfer formula word, splits described many tone recognitions word in units of syllable, obtains each syllable of many tone recognitions word Unit;The affixe part depending on word unit is marked, separates with stem part, it is thus achieved that affixe portion The syllable unit divided;
Tone predicting unit, for the load bearing unit with syllable as tone to institute in described pending Tibetan language text Each syllable in word unit is had to carry out tone prediction;
Tone adjustment unit, for being set to weak reading by all affixes.
A kind of Tibetan language tone Forecasting Methodology of embodiment of the present invention offer and system, by pending Tibetan language Text carries out word segmentation processing, it is thus achieved that pending word unit and the candidate part of speech of each word unit, it is then determined that wait to locate The type of word unit in reason Tibetan language text, and according to the part of speech of its prediction word unit, finally according to pending Tibetan The context information of Chinese language word unit in this and the tone of part of speech prediction word unit.Due at prediction word list During unit's tone, not only allow for the context environmental of word unit, it is also contemplated that the part of speech of word unit is to sound The impact adjusted, so that the liaison modified tone in the continuous flow of Tibetan language is more natural.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to enforcement In example, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only Some embodiments described in the present invention, for those of ordinary skill in the art, it is also possible to according to these Accompanying drawing obtains other accompanying drawing.
Fig. 1 is a kind of flow chart of the Tibetan language tone Forecasting Methodology that the embodiment of the present invention provides;
Fig. 2 is a kind of structural representation of the Tibetan language tone prognoses system that the embodiment of the present invention provides.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings With embodiment, the embodiment of the present invention is described in further detail.Following example are exemplary, only For explaining the present invention, and it is not construed as limiting the claims.
In order to be more fully understood that the present invention, first below Tibetan language tone Forecasting Methodology in prior art is carried out letter Unitary declaration.Existing Tibetan language tone Forecasting Methodology usually uses rule-based method to pending text Tone is predicted, such as: according to the compound mode of the initial and the final, looks into tone tune type table and obtains the sound of syllable Key type.In Tibetan language, single syllable tone is determined by initial consonant and simple or compound vowel of a Chinese syllable, and wherein, initial consonant is by front adding Word, upper word adding, base word, down word adding are constituted, and the tone of initial consonant is divided into high and low two classes, represent syllable sound The starting point adjusted is high and low.Generally, base word is turbid situation initial consonant for height, and the situation initial consonant that base word is clear is Low.Pre-script, upper word adding can change base word tone and obtain height.Simple or compound vowel of a Chinese syllable is by vowel character, back word adding, again Back word adding is constituted, and according to the ending of a final of simple or compound vowel of a Chinese syllable, simple or compound vowel of a Chinese syllable can be divided into 3 classes, the longest simple or compound vowel of a Chinese syllable, rush simple or compound vowel of a Chinese syllable, list Vowel simple or compound vowel of a Chinese syllable.When syllable in current word unit is carried out tone prediction, first true according to the initial and the final type list Determining the type combination that current syllable sound is female, described sound parent type table is typically built by domain expert;Then Search tone tune type table, determine the syllable tone tune type that the initial and the final combines, described tone tune type table comprises The tone tune type of initial consonant and the various combination of simple or compound vowel of a Chinese syllable, the method being generally basede on rule builds, and described rule describes The characteristic of speech sounds of Tibetan language.Such as syllableWherein, base word isPre-script is Initial consonant isBelong to high class;Pre-script isNot changing initial consonant attribute, vowel sign is Back word adding isSimple or compound vowel of a Chinese syllable isBelong to rush simple or compound vowel of a Chinese syllable, so that it is determined that the sound mother of syllable is combined as high sound Simple or compound vowel of a Chinese syllable that is female and that promote, by looking into tone tune type table, obtains syllableTone be falling tone (f).
But, in the continuous flow that Tibetan language is actual, tone tune type often occurs that liaison modifies tone, the most same Syllable is under different context environmentals, and tone tune type can change along with the part of speech of word and change so that existing According to the compound mode of the initial and the final, the method predicting the tone tune type of syllable by tabling look-up, it is impossible to solution is many The word unit tone of tone type changes along with the change of part of speech, causes the word list of many tone recognitions part of speech type The problem that unit's tone forecasting inaccuracy is true.
The Tibetan language tone Forecasting Methodology of present invention offer and system, by originally carrying out participle to pending Tibetan language and literature Process, it is thus achieved that each word unit and candidate's part of speech thereof, and determine the part of speech of word unit, then according to word unit Context information and part of speech, it was predicted that the tone of word unit, owing to considering word unit part of speech to word unit The impact of tone, the problem solving the tone tune type change of many tone recognitions word in the continuous flow of Tibetan language, have Effect improves the application effect of voice system.
In order to be better understood from technical scheme and technique effect, below with reference to flow chart with concrete Embodiment be described in detail.
As it is shown in figure 1, be the flow chart of Tibetan language tone Forecasting Methodology that provides of the embodiment of the present invention, including with Lower step:
Step S01, receives pending Tibetan language text.
Step S02, originally carries out word segmentation processing to pending Tibetan language and literature, obtains each word unit and each word unit Candidate part of speech.
In actual applications, need to carry out participle according to Tibetan language participle principle, obtain participle unit, described point Word principle, as carried out participle according to the dictionary for word segmentation built in advance, obtains the word unit after participle and each word list The candidate part of speech of unit, described dictionary for word segmentation comprises the information such as candidate's part of speech of word unit.
Step S03, determines the type of each word unit.
In embodiments of the present invention, four kinds of different Changing Patterns are met, therefore owing to the tone of Tibetan language changes Can be according to the change of tone in different context environmentals of the syllable in the word unit of pending Tibetan language text Change form, is four types by word dividing elements, i.e. many tone recognitions word, function word, affixe, conventional word, Below the word unit of these four type is described in detail respectively:
1) function word: the function word of different initial consonants is when isolated syllable, and tone is different;If high class initial consonant is in difference Under the effect of back word adding, tone is h (high), f (fall);Low class initial consonant under the effect of different back word addings, Tone is l (low), r (liter);In flow, nearly all function word tone is all read as l or r, such as lattice Auxiliary word, conjunction, preposition, state, modal particle etc., generally read relatively low;
2) many tone recognitions word: when taking on different part of speech, tone recognition is different, such as: Latin transliteration is khrom skor, and when taking on noun, tone integrated mode is hh, tone combination when taking on verb Pattern is fr;
3) affixe: pronounce with weak read mode, such as affixe ba, wa, bo, pa, po, ma, mo, pronunciation is similar to In Mandarin Chinese softly;
4) conventional word: tone exists inherent modified tone rule.As being all read as low-key l when two syllable lists are read Syllable, form a word, be not to be read as ll, but be often read as lh, i.e. dissyllabic low-key Become to a high-profile.
In actual applications, can be beforehand through collecting a large amount of Tibetan language data construct word cell type dictionaries, institute Predicate cell type dictionary includes: function word dictionary, affixe dictionary, many tone recognitions dictionary, then, and will be each Word unit mates with each word cell type dictionary built in advance, determines the type of each word unit.
In a specific embodiment, determine that the process of the type of word unit is as follows:
Step 1) word unit each in pending text is labeled as conventional word unit, perform step 2);
Step 2) judge whether current word unit is present in function word dictionary;If it is present current word unit For function word, perform step 5);Otherwise perform step 3);
Step 3) utilize affixe dictionary, it is judged that and whether current word unit contains affixe part, and institute's predicate unit can With containing multiple affixe parts, affixe also can be separately as a word unit, and described judgement current word unit is Whether no be that affixe may include that containing affixe part or current word unit
First, current word unit is split into single syllable, it is assumed that syllable number is n, it is assumed that in affixe dictionary Most syllable number that single affixe comprises are m;
Then, start to select forward min{m, n} syllable and affixe from the ultima of current word unit Comprising min{m in dictionary, the affixe of n} syllable mates.If the match is successful, then find current word list Affixe information belonging to syllable in unit;If it fails to match, after selecting current word unit the most successively Min{m, n}-1 syllable with comprise min{m, the affixe of n}-1 syllable mates;Until choosing current Word unit ultima;If period, the match is successful, then current word unit contains affixe, and finds and work as The affixe information belonging to syllable in front word unit;If period mates unsuccessful, then current word unit does not contains Having affixe, wherein, min{m, n} represent and take smaller in m and n;Particularly, all in current word unit All syllables that in syllable and affixe dictionary, single affixe is comprised are time the match is successful, and current word unit is word Sew;
Finally, whether contain affixe part according to current word unit or whether current word unit be affixe, it is judged that Step to be performed, such as: current word unit contains affixe part or current word unit is affixe, Perform step 5);Current word unit does not comprise affixe information;Perform step 4);
Step 4) judge whether the part-of-speech information of current word unit is present in many tone recognitions dictionary, if deposited , then current word unit is many tone recognitions word, performs step 5);If it does not, current word unit is Conventional word, performs step 5);
Step 5) obtain current word unit type.
Further, the present embodiment can also based on context environmental information, again judge current word unit Type, such as, according to the context information of the function word/affixe of above-mentioned judgement, judges current word again Whether unit is function word unit/affixe unit.
Step S04, according to institute's predicate unit context information in described pending Tibetan language text and The type of institute's predicate unit, determines the part of speech of institute's predicate unit.
The most identical owing to Tibetan language existing very multi word unit, but in different context environmentals its The phenomenon that part of speech is the most different, accordingly, the tone of these word unit changes the most therewith, i.e. these word unit tool There is ambiguous category part of speech;Accordingly, it would be desirable to carry out part of speech prediction, the present embodiment according to context environmental residing for word unit According to word unit context information in pending Tibetan language text and the type of institute's predicate unit, use The part of speech of word unit is predicted by statistical modeling method, such as:
For function word, according to the context information of described function word, statistical modeling method is used to determine described The part of speech of function word;
For affixe, all affixes use identical part of speech;
For many tone recognitions word, using the segmentation sequence between function word before and after it as current many tone recognitions word Participle fragment residing for unit, according to its place context information, uses statistical modeling method to determine described The part of speech of many tone recognitions word;
For conventional word, all conventional word use identical part of speech.
In a specific embodiment, can as follows the part of speech of word unit be predicted:
Step 1) function word unit part of speech prediction, function word part of speech include case adverbial verb, conjunction, preposition, modal particle, Auxiliary verb, by considering current function word unit contextual information, uses statistical modeling method to carry out function word optimum Part of speech is predicted, such as, when carrying out the prediction of function word part of speech, in units of whole sentence, the function word in distich carries out word Property prediction, when specifically predicting, consider the candidate part-of-speech information of all function words in sentence, carry out optimal path Judging, the candidate part of speech on optimal path is as the optimum part of speech of function word each in current sentence.
Step 2) prediction of affixe unit part of speech
In the present embodiment, all affixe parts and affixe use identical part of speech, as can be directly by its word Property is set as affixe.
Step 3) prediction of many tone recognitions word unit part of speech
Using the segmentation sequence between function word unit before and after many tone recognitions word unit as current many tone recognitions Participle fragment residing for word unit, according to its place context information, uses statistical modeling method to carry out Excellent part of speech is predicted, determines the part of speech of each many tone recognitions word unit, the context environmental of many tone recognitions word Information includes:
(1) the candidate part-of-speech information of many tone recognitions word, described candidate part-of-speech information in described participle fragment During for carrying out participle according to dictionary for word segmentation, the candidate part of speech of the word unit obtained;
(2) current many tone recognitions word position in described participle fragment;
(3) the total number of word unit in current participle fragment;
(4) in the sentence of described many tone recognitions word place, the function word number before many tone recognitions word;
(5) function word part of speech nearest before many tone recognitions word;
(6) in the sentence of described many tone recognitions word place, the function word number after many tone recognitions word;
(7) nearest function word part of speech after many tone recognitions word;
(8) whether current many tone recognitions word contains affixe part.
Step 4) prediction of conventional word unit part of speech.
In the present embodiment, all conventional word use identical part of speech, as can be directly its part of speech be set to often Rule property word.
Step S05, according to context information and the part of speech of word unit of institute's predicate unit, determines described The tone information of word unit.
The present embodiment is when predicting the tone of word unit, it is contemplated that the part of speech of word unit is to many tone recognitions word etc. The impact of the tone of conversion of parts of speech so that the tone of the word unit doped is more accurate.Sound at prediction word unit Before tune, need word unit to be carried out pretreatment, such as according to the part of speech of word unit:
If current word unit is function word, when carrying out tone prediction, then the initial consonant of function word is set to low class initial consonant;
If current word unit depends on affixe part thereon, then affixe part is marked, with word Stem portion separates, it is thus achieved that word unit affixe part and the syllable unit of non-affixe part;
If current word unit is many tone recognitions word, and its part of speech is verb or adjective, then with syllable Split described many tone recognitions word for unit, obtain each syllable unit of many tone recognitions word, otherwise with many sound Mode transfer formula word is syllable unit;
Then, according to the result of above pretreatment, then the load bearing unit with syllable unit as tone is treated described In process Tibetan language text, in all word unit, each syllable unit carries out tone prediction.
In the present embodiment, after function word, many tone recognitions word, affixe and conventional word are carried out above-mentioned process, For unit, each syllable in described pending Tibetan language text is carried out tone prediction with syllable.Wherein, use Syllable before and after the initial consonant classification of contextual feature such as current syllable, the simple or compound vowel of a Chinese syllable classification of current syllable, current syllable The initial and the final classification, position in the word unit of current syllable place, the length of current syllable place word unit, The part of speech etc. of current syllable place word unit, wherein it should be noted that affixe and conventional word are being entered by this case During the prediction of row tone, all affixes use identical part of speech, and all conventional word use identical part of speech.
Finally, it is also possible to after the tone of prediction word unit, according to the exclusive feature of Tibetan language, to word unit Tone is adjusted, such as: all affixes are set to weak reading, conventional affixe such as ba, wa, bo, pa, po, ma, mo。
Step S06, obtains the tone letter of described pending Tibetan language text according to the tone information of each word unit Breath.
The Tibetan language tone Forecasting Methodology that the embodiment of the present invention provides, by entering the pending Tibetan language text received Row word segmentation processing, obtains each word unit and candidate's part of speech thereof, and determines the type of each word unit, then foundation The type of word unit and word unit context information in described pending Tibetan language text, determine described The part of speech of word unit, and the context information and part of speech thereof according to word unit carry out tone prediction;Make When the Tibetan language tone carried out according to the present invention is predicted, it is contemplated that the impact on its tone of the part of speech of word unit, solve Having determined in the continuous flow that Tibetan language is actual, tone tune type often occurs that liaison modifies tone, the most same word unit Under different context environmentals, the problem that part of speech and tone tune type can change so that in the continuous flow of Tibetan language Liaison modified tone is more natural.
Accordingly, present invention also offers Tibetan language tone prognoses system, including:
Receiver module 201, is used for receiving pending Tibetan language text;
Word-dividing mode 202, for pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and each The candidate part of speech of word unit;
Determination type module 203, for determining the type of each word unit;
Part of speech determines module 204, for according to upper and lower in described pending Tibetan language text of institute's predicate unit Literary composition environmental information and the type of institute's predicate unit, determine the part of speech of institute's predicate unit;
Word unit tone determines module 205, for the context information according to institute's predicate unit and word list The part of speech of unit, determines the tone information of institute's predicate unit;
Text tone acquisition module 206, obtains described pending Tibetan for the tone information according to each word unit Chinese language tone information originally.
In the present embodiment, the pending Tibetan language text that receiver module 201 is received can comprise all types Tibetan language words composition Tibetan language text.
In actual applications, word-dividing mode 202, with word as unit, originally carries out participle to pending Tibetan language and literature, Obtain each participle unit.Wherein, participle principle adapts to Tibetan language information processing so that computer automatically processes.
Then, the type of each word unit is determined by determination type module 203.In the present embodiment, permissible Determined the type of each word unit by the method consulted the dictionary, the most described system also includes:
Dictionary creation module 303, for by collecting a large amount of Tibetan language data construct word cell type dictionaries, institute Predicate cell type dictionary includes: function word dictionary, affixe dictionary, many tone recognitions dictionary;Accordingly, described Determination type module 203, specifically for by by each word unit and each word cell type word built in advance Allusion quotation is mated, and determines the type of each word unit.
In the present embodiment, can determine that module 204 determines according to the type of each word unit by part of speech The part of speech of each word unit, for different types of word unit, part of speech determines that the process that module 204 is carried out is different, Being specifically as follows, described part of speech determines that module 204 includes:
Function word part of speech determines unit, for the context information according to described function word, uses statistical modeling Method determines the part of speech of described function word;
Affixe part of speech determines unit, for all affixes are defined as identical part of speech;
Many tone recognitions word part of speech determines unit, and before and after using it, the segmentation sequence between function word is as currently Participle fragment residing for many tone recognitions word unit, according to its place context information, uses statistical modeling Method determines the part of speech of described many tone recognitions word;
Conventional word part of speech determines unit, for all conventional word are defined as identical part of speech.
Then, part of speech and the context environmental letter thereof of each word unit that module 204 obtains is determined according to part of speech By word unit tone, breath, determines that module 205 predicts the tone of each word unit.
In actual applications, institute's predicate unit tone determines that module 205 includes:
Pretreatment unit, is set to low class initial consonant by the initial consonant of function word;It is verb or adjectival many sound by part of speech Mode transfer formula word, splits described many tone recognitions word in units of syllable, obtains each syllable of many tone recognitions word Unit;
Tone predicting unit, for the load bearing unit with syllable as tone to institute in described pending Tibetan language text Each syllable in word unit is had to carry out tone prediction;
Further, the tone of word unit can also be adjusted by this module, and institute's predicate unit tone determines Module 205 can also include:
Tone adjustment unit, for being set to weak reading by all affixes.
The embodiment of the present invention provide Tibetan language tone prognoses system, by word-dividing mode 202 by receive wait locate Reason Tibetan language text carries out word segmentation processing, obtains each word unit and candidate's part of speech thereof, and then use pattern determines mould Block 203 determines the type of each word unit, and believes according to the type of word unit and the context environmental of word unit By part of speech, breath, determines that module 204 determines the part of speech of each word unit, when the tone carrying out word unit is predicted, Owing to considering the part of speech of each word unit so that determine, by word unit tone, each word list that module 205 determines The tone information of unit is more accurate, solves in the continuous flow that Tibetan language is actual, and tone tune type often occurs Liaison modifies tone, the most same word unit under different context environmentals, the problem that part of speech and tone tune type can change, Make the liaison modified tone in the continuous flow of Tibetan language more natural.
Each embodiment in this specification all uses the mode gone forward one by one to describe, phase homophase between each embodiment As part see mutually, what each embodiment stressed is different from other embodiments it Place.For system embodiment, owing to it is substantially similar to embodiment of the method, so describing Fairly simple, relevant part sees the part of embodiment of the method and illustrates.System described above is implemented Example is only that schematically the wherein said unit illustrated as separating component can be or may not be Physically separate, the parts shown as unit can be or may not be physical location, the most permissible It is positioned at a place, or can also be distributed on multiple NE.Can select according to the actual needs Some or all of module therein realizes the purpose of the present embodiment scheme.Those of ordinary skill in the art exist In the case of not paying creative work, i.e. it is appreciated that and implements.
Being described in detail the embodiment of the present invention above, detailed description of the invention used herein is to this Bright being set forth, the explanation of above example is only intended to help to understand the method and apparatus of the present invention;With Time, for one of ordinary skill in the art, according to the thought of the present invention, in detailed description of the invention and application All will change in scope, in sum, this specification content should not be construed as limitation of the present invention.

Claims (9)

1. a Tibetan language tone Forecasting Methodology, it is characterised in that including:
Receive pending Tibetan language text;
Pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and the candidate part of speech of each word unit;
Determine the type of each word unit;
According to institute's predicate unit context information in described pending Tibetan language text and institute's predicate list The type of unit, determines the part of speech of institute's predicate unit;
Context information according to institute's predicate unit and the part of speech of word unit, determine the sound of institute's predicate unit Tune information;
Tone information according to each word unit obtains the tone information of described pending Tibetan language text.
Method the most according to claim 1, it is characterised in that the type of institute's predicate unit include with Lower any one or more: many tone recognitions word, function word, affixe, conventional word;
Described method also includes:
By collecting a large amount of Tibetan language data construct word cell type dictionaries, institute's predicate cell type dictionary includes: Function word dictionary, affixe dictionary, many tone recognitions dictionary;
The described type determining each word unit includes:
By being mated with each word cell type dictionary built in advance by each word unit, determine each word list The type of unit.
Method the most according to claim 2, it is characterised in that described according to institute's predicate unit in institute State the context information in pending Tibetan language text and the type of institute's predicate unit, determine institute's predicate unit Part of speech include:
For function word, according to the context information of described function word, statistical modeling method is used to determine described The part of speech of function word;
For affixe, all affixes use identical part of speech;
For many tone recognitions word, using the segmentation sequence between function word before and after it as current many tone recognitions word Participle fragment residing for unit, according to its place context information, uses statistical modeling method to determine described The part of speech of many tone recognitions word;
For conventional word, all conventional word use identical part of speech.
Method the most according to claim 3, it is characterised in that described many tone recognitions word upper and lower Literary composition environmental information includes:
(1) the candidate part-of-speech information of many tone recognitions word, described candidate part-of-speech information in described participle fragment During for carrying out participle according to dictionary for word segmentation, the candidate part of speech of the word unit obtained;
(2) current many tone recognitions word position in described participle fragment;
(3) the total number of word unit in current participle fragment;
(4) in the sentence of described many tone recognitions word place, the function word number before many tone recognitions word;
(5) function word part of speech nearest before many tone recognitions word;
(6) in the sentence of described many tone recognitions word place, the function word number after many tone recognitions word;
(7) nearest function word part of speech after many tone recognitions word;
(8) whether current many tone recognitions word contains affixe part.
Method the most according to claim 1, it is characterised in that the described context according to word unit Environmental information and word unit part of speech, it was predicted that word unit tone includes:
If current word unit is function word, then the initial consonant of function word is set to low class initial consonant;
If current word unit is many tone recognitions word, and its part of speech is verb or adjective, then with syllable Split described many tone recognitions word for unit, obtain each syllable unit of many tone recognitions word;
If current word unit comprises affixe part, then the affixe part depending on word unit is marked, Separate with stem part, it is thus achieved that the syllable unit of affixe part;
Load bearing unit with syllable as tone is to each sound in all word unit in described pending Tibetan language text Joint carries out tone prediction;
All affixes are set to weak reading.
6. a Tibetan language tone prognoses system, it is characterised in that including:
Receiver module, is used for receiving pending Tibetan language text;
Word-dividing mode, for pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and each word list The candidate part of speech of unit;
Determination type module, for determining the type of each word unit;
Part of speech determines module, for according to institute's predicate unit context ring in described pending Tibetan language text Environment information and the type of institute's predicate unit, determine the part of speech of institute's predicate unit;
Word unit tone determines module, for according to the context information of institute's predicate unit and word unit Part of speech, determines the tone information of institute's predicate unit;
Text tone acquisition module, obtains described pending Tibetan language and literature for the tone information according to each word unit This tone information.
System the most according to claim 6, it is characterised in that described system also includes:
Dictionary creation module, for by collecting a large amount of Tibetan language data construct word cell type dictionaries, institute's predicate Cell type dictionary includes: function word dictionary, affixe dictionary, many tone recognitions dictionary;
Described determination type module, specifically for by by each word unit and each word unit class built in advance Type dictionary mates, and determines the type of each word unit.
System the most according to claim 7, it is characterised in that described part of speech determines that module includes:
Function word part of speech determines unit, for the context information according to described function word, uses statistical modeling Method determines the part of speech of described function word;
Affixe part of speech determines unit, for all affixes are defined as same part of speech;
Many tone recognitions word part of speech determines unit, and before and after using it, the segmentation sequence between function word is as currently Participle fragment residing for many tone recognitions word unit, according to its place context information, uses statistical modeling Method determines the part of speech of described many tone recognitions word;
Conventional word part of speech determines unit, for all conventional word are defined as same part of speech.
System the most according to claim 6, it is characterised in that institute's predicate unit tone determines module Including:
Pretreatment unit, is set to low class initial consonant by the initial consonant of function word;It is verb or adjectival many sound by part of speech Mode transfer formula word, splits described many tone recognitions word in units of syllable, obtains each syllable of many tone recognitions word Unit;The affixe part depending on word unit is marked, separates with stem part, it is thus achieved that affixe portion The syllable unit divided;
Tone predicting unit, for the load bearing unit with syllable as tone to institute in described pending Tibetan language text Each syllable in word unit is had to carry out tone prediction;
Tone adjustment unit, for being set to weak reading by all affixes.
CN201510332794.6A 2015-06-12 2015-06-12 A kind of Tibetan language tone prediction technique and system Active CN106294311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510332794.6A CN106294311B (en) 2015-06-12 2015-06-12 A kind of Tibetan language tone prediction technique and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510332794.6A CN106294311B (en) 2015-06-12 2015-06-12 A kind of Tibetan language tone prediction technique and system

Publications (2)

Publication Number Publication Date
CN106294311A true CN106294311A (en) 2017-01-04
CN106294311B CN106294311B (en) 2019-03-19

Family

ID=57650055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510332794.6A Active CN106294311B (en) 2015-06-12 2015-06-12 A kind of Tibetan language tone prediction technique and system

Country Status (1)

Country Link
CN (1) CN106294311B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444676A (en) * 2018-12-28 2020-07-24 北京深知无限人工智能研究院有限公司 Part-of-speech tagging method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440236A (en) * 2013-09-16 2013-12-11 中央民族大学 United labeling method for syntax of Tibet language and semantic roles
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
CN104538025A (en) * 2014-12-23 2015-04-22 西北师范大学 Method and device for converting gestures to Chinese and Tibetan bilingual voices

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440236A (en) * 2013-09-16 2013-12-11 中央民族大学 United labeling method for syntax of Tibet language and semantic roles
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
CN104538025A (en) * 2014-12-23 2015-04-22 西北师范大学 Method and device for converting gestures to Chinese and Tibetan bilingual voices

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
林秀艳: "藏汉语声调异同之比较", 《阿坝师范高等专科学校学报》 *
索南扎西: "藏语语音合成关键技术研究", 《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》 *
羊毛卓么: "藏文词性自动标注系统的研究与实现", 《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444676A (en) * 2018-12-28 2020-07-24 北京深知无限人工智能研究院有限公司 Part-of-speech tagging method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN106294311B (en) 2019-03-19

Similar Documents

Publication Publication Date Title
Protopapas et al. IPLR: An online resource for Greek word-level and sublexical information
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
Gorman et al. Improving homograph disambiguation with supervised machine learning
CN105989833A (en) Multilingual mixed-language text character-pronunciation conversion method and system
Fernando et al. Comprehensive part-of-speech tag set and svm based pos tagger for sinhala
Kirchhoff et al. Novel speech recognition models for Arabic
JP2008225963A (en) Machine translation device, replacement dictionary creating device, machine translation method, replacement dictionary creating method, and program
Sun et al. Knowledge distillation from bert in pre-training and fine-tuning for polyphone disambiguation
Rathod et al. Survey of various POS tagging techniques for Indian regional languages
Bar-Haim et al. Choosing an optimal architecture for segmentation and POS-tagging of Modern Hebrew
Chennoufi et al. Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization
Hatab et al. Enhancing deep learning with embedded features for Arabic named entity recognition
CN106294311A (en) A kind of Tibetan language tone Forecasting Methodology and system
Leidig et al. Automatic detection of anglicisms for the pronunciation dictionary generation: a case study on our German IT corpus.
Eldos Arabic text data mining: A root-based hierarchical indexing model
CN106294310A (en) A kind of Tibetan language tone Forecasting Methodology and system
Black et al. Syntactic annotation: linguistic aspects of grammatical tagging and skeleton parsing
Nunsanga et al. Part-of-speech tagging for mizo language using conditional random field
Moumen et al. Arabic diacritization with gated recurrent unit
Saychum et al. Efficient Thai Grapheme-to-Phoneme Conversion Using CRF-Based Joint Sequence Modeling.
Olivo et al. CRFPOST: Part-of-Speech Tagger for Filipino Texts using Conditional Random Fields
Rajendran et al. Text processing for developing unrestricted Tamil text to speech synthesis system
CN111090720A (en) Hot word adding method and device
Švec et al. Automatic correction of i/y spelling in Czech ASR output
Verulkar et al. Transliterated search of Hindi lyrics

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant