CN106294311A - A kind of Tibetan language tone Forecasting Methodology and system - Google Patents
A kind of Tibetan language tone Forecasting Methodology and system Download PDFInfo
- Publication number
- CN106294311A CN106294311A CN201510332794.6A CN201510332794A CN106294311A CN 106294311 A CN106294311 A CN 106294311A CN 201510332794 A CN201510332794 A CN 201510332794A CN 106294311 A CN106294311 A CN 106294311A
- Authority
- CN
- China
- Prior art keywords
- word
- unit
- tone
- speech
- recognitions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a kind of Tibetan language tone Forecasting Methodology and system, including receiving pending Tibetan language text;Pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and the candidate part of speech of each word unit;Determine the type of each word unit;According to institute's predicate unit context information in described pending Tibetan language text and the type of institute's predicate unit, determine the part of speech of institute's predicate unit;Context information according to institute's predicate unit and the part of speech of word unit, determine the tone information of institute's predicate unit;Tone information according to each word unit obtains the tone information of described pending Tibetan language text.The present invention, during prediction word unit tone, not only allows for the context information of word unit, it is also contemplated that the impact on tone of the part of speech of word unit, so that the liaison modified tone in the continuous flow of Tibetan language is more natural.
Description
Technical field
The present invention relates to natural language processing field, be specifically related to a kind of Tibetan language tone Forecasting Methodology and system.
Background technology
The most prominent phonetic feature of Han-Tibetan family language is exactly tone, the research of Tibetan language tone always with
It it is all one of the emphasis of Sino-Tibetan speech research.Briefly, tone is exactly the height of sound.
Tibetan language includes Lhasa words, health bar words, Anduo County's words etc., wherein based on Lhasa words.Continuous language is talked about in Lhasa
In stream, there is a lot of change in same single syllable tune type in different context environmentals, same polysyllabic word is not
With there is also different tone tune types in context environmental, owing to the form of expression of tone is different, word unit is also
Often taking on different parts of speech, the change of tone and the dependency of part of speech are relatively big, and the change of tune type can be serious
The understanding that impact is semantic.If the tone tune type of Tibetan language can not be predicted exactly, it will reduce Tibetan voice system
The application effect of system.
Existing Tibetan language tone Forecasting Methodology is typically all rule-based method, i.e. initial consonant, rhythm to syllable
Mother classifies, and according to sorted initial consonant and the combined situation of simple or compound vowel of a Chinese syllable, obtains sound by looking into tone tune type table
The tone tune type of joint.But, existing method is sufficiently complete, at Tibetan language to the analysis of tone tune type change mechanism
In actual continuous flow, tone tune type often occurs that liaison modifies tone, and the most same word unit is upper and lower in difference
Under literary composition environment, part of speech and tone tune type can change, and existing methodical rule coverage is not complete, are difficult to solve Tibetan language
Liaison modified tone problem in flow continuously, thus cause tone prediction effect undesirable, as Tibetan voice synthesizes
Naturalness reduce, even affecting intelligibility.
Summary of the invention
The embodiment of the present invention provides a kind of Tibetan language tone Forecasting Methodology and system, and solution is continuous Tibetan language reality
In flow, tone tune type often occurs that liaison modifies tone, the most same word unit under different context environmentals,
The problem that part of speech and tone tune type can change, so that the liaison modified tone in the continuous flow of Tibetan language is more natural.
To this end, the embodiment of the present invention following technical scheme of offer:
A kind of Tibetan language tone Forecasting Methodology, including:
Receive pending Tibetan language text;
Pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and the candidate part of speech of each word unit;
Determine the type of each word unit;
According to institute's predicate unit context information in described pending Tibetan language text and institute's predicate list
The type of unit, determines the part of speech of institute's predicate unit;
Context information according to institute's predicate unit and the part of speech of word unit, determine the sound of institute's predicate unit
Tune information;
Tone information according to each word unit obtains the tone information of described pending Tibetan language text.
Preferably, the type of institute's predicate unit includes following any one or more: many tone recognitions word, void
Word, affixe, conventional word;
Described method also includes:
By collecting a large amount of Tibetan language data construct word cell type dictionaries, institute's predicate cell type dictionary includes:
Function word dictionary, affixe dictionary, many tone recognitions dictionary;
The described type determining each word unit includes:
By being mated with each word cell type dictionary built in advance by each word unit, determine each word list
The type of unit.
Preferably, the described context information according to institute's predicate unit in described pending Tibetan language text
And the type of institute's predicate unit, determine that the part of speech of institute's predicate unit includes:
For function word, according to the context information of described function word, statistical modeling method is used to determine described
The part of speech of function word;
For affixe, all affixes use identical part of speech;
For many tone recognitions word, using the segmentation sequence between function word before and after it as current many tone recognitions word
Participle fragment residing for unit, according to its place context information, uses statistical modeling method to determine described
The part of speech of many tone recognitions word;
For conventional word, all conventional word use identical part of speech.
Preferably, the context information of described many tone recognitions word includes:
(1) the candidate part-of-speech information of many tone recognitions word, described candidate part-of-speech information in described participle fragment
During for carrying out participle according to dictionary for word segmentation, the candidate part of speech of each word unit obtained;
(2) current many tone recognitions word position in described participle fragment;
(3) the total number of word unit in current participle fragment;
(4) in the sentence of described many tone recognitions word place, the function word number before many tone recognitions word;
(5) function word part of speech nearest before many tone recognitions word;
(6) in the sentence of described many tone recognitions word place, the function word number after many tone recognitions word;
(7) nearest function word part of speech after many tone recognitions word;
(8) whether current many tone recognitions word contains affixe part.
Preferably, the described context information according to word unit and word unit part of speech, it was predicted that word unit sound
Stealthily substitute and include:
If current word unit is function word, then the initial consonant of function word is set to low class initial consonant;
If current word unit is many tone recognitions word, and its part of speech is verb or adjective, then with syllable
Split described many tone recognitions word for unit, obtain each syllable unit of many tone recognitions word;
If current word unit comprises affixe part, then the affixe part depending on word unit is marked,
Separate with stem part, it is thus achieved that the syllable unit of affixe part;
Load bearing unit with syllable as tone is to each sound in all word unit in described pending Tibetan language text
Joint carries out tone prediction;
All affixes are set to weak reading.
A kind of Tibetan language tone prognoses system, including:
Receiver module, is used for receiving pending Tibetan language text;
Word-dividing mode, for pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and each word list
The candidate part of speech of unit;
Determination type module, for determining the type of each word unit;
Part of speech determines module, for according to institute's predicate unit context ring in described pending Tibetan language text
Environment information and the type of institute's predicate unit, determine the part of speech of institute's predicate unit;
Word unit tone determines module, for according to the context information of institute's predicate unit and word unit
Part of speech, determines the tone information of institute's predicate unit;
Text tone acquisition module, obtains described pending Tibetan language and literature for the tone information according to each word unit
This tone information.
Preferably, described system also includes:
Dictionary creation module, for by collecting a large amount of Tibetan language data construct word cell type dictionaries, institute's predicate
Cell type dictionary includes: function word dictionary, affixe dictionary, many tone recognitions dictionary;
Described determination type module, specifically for by by each word unit and each word unit class built in advance
Type dictionary mates, and determines the type of each word unit.
Preferably, described part of speech determines that module includes:
Function word part of speech determines unit, for the context information according to described function word, uses statistical modeling
Method determines the part of speech of described function word;
Affixe part of speech determines unit, for all affixes are defined as same part of speech;
Many tone recognitions word part of speech determines unit, and before and after using it, the segmentation sequence between function word is as currently
Participle fragment residing for many tone recognitions word unit, according to its place context information, uses statistical modeling
Method determines the part of speech of described many tone recognitions word;
Conventional word part of speech determines unit, for all conventional word are defined as same part of speech.
Preferably, institute's predicate unit tone determines that module includes:
Pretreatment unit, is set to low class initial consonant by the initial consonant of function word;It is verb or adjectival many sound by part of speech
Mode transfer formula word, splits described many tone recognitions word in units of syllable, obtains each syllable of many tone recognitions word
Unit;The affixe part depending on word unit is marked, separates with stem part, it is thus achieved that affixe portion
The syllable unit divided;
Tone predicting unit, for the load bearing unit with syllable as tone to institute in described pending Tibetan language text
Each syllable in word unit is had to carry out tone prediction;
Tone adjustment unit, for being set to weak reading by all affixes.
A kind of Tibetan language tone Forecasting Methodology of embodiment of the present invention offer and system, by pending Tibetan language
Text carries out word segmentation processing, it is thus achieved that pending word unit and the candidate part of speech of each word unit, it is then determined that wait to locate
The type of word unit in reason Tibetan language text, and according to the part of speech of its prediction word unit, finally according to pending Tibetan
The context information of Chinese language word unit in this and the tone of part of speech prediction word unit.Due at prediction word list
During unit's tone, not only allow for the context environmental of word unit, it is also contemplated that the part of speech of word unit is to sound
The impact adjusted, so that the liaison modified tone in the continuous flow of Tibetan language is more natural.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to enforcement
In example, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only
Some embodiments described in the present invention, for those of ordinary skill in the art, it is also possible to according to these
Accompanying drawing obtains other accompanying drawing.
Fig. 1 is a kind of flow chart of the Tibetan language tone Forecasting Methodology that the embodiment of the present invention provides;
Fig. 2 is a kind of structural representation of the Tibetan language tone prognoses system that the embodiment of the present invention provides.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings
With embodiment, the embodiment of the present invention is described in further detail.Following example are exemplary, only
For explaining the present invention, and it is not construed as limiting the claims.
In order to be more fully understood that the present invention, first below Tibetan language tone Forecasting Methodology in prior art is carried out letter
Unitary declaration.Existing Tibetan language tone Forecasting Methodology usually uses rule-based method to pending text
Tone is predicted, such as: according to the compound mode of the initial and the final, looks into tone tune type table and obtains the sound of syllable
Key type.In Tibetan language, single syllable tone is determined by initial consonant and simple or compound vowel of a Chinese syllable, and wherein, initial consonant is by front adding
Word, upper word adding, base word, down word adding are constituted, and the tone of initial consonant is divided into high and low two classes, represent syllable sound
The starting point adjusted is high and low.Generally, base word is turbid situation initial consonant for height, and the situation initial consonant that base word is clear is
Low.Pre-script, upper word adding can change base word tone and obtain height.Simple or compound vowel of a Chinese syllable is by vowel character, back word adding, again
Back word adding is constituted, and according to the ending of a final of simple or compound vowel of a Chinese syllable, simple or compound vowel of a Chinese syllable can be divided into 3 classes, the longest simple or compound vowel of a Chinese syllable, rush simple or compound vowel of a Chinese syllable, list
Vowel simple or compound vowel of a Chinese syllable.When syllable in current word unit is carried out tone prediction, first true according to the initial and the final type list
Determining the type combination that current syllable sound is female, described sound parent type table is typically built by domain expert;Then
Search tone tune type table, determine the syllable tone tune type that the initial and the final combines, described tone tune type table comprises
The tone tune type of initial consonant and the various combination of simple or compound vowel of a Chinese syllable, the method being generally basede on rule builds, and described rule describes
The characteristic of speech sounds of Tibetan language.Such as syllableWherein, base word isPre-script is
Initial consonant isBelong to high class;Pre-script isNot changing initial consonant attribute, vowel sign is
Back word adding isSimple or compound vowel of a Chinese syllable isBelong to rush simple or compound vowel of a Chinese syllable, so that it is determined that the sound mother of syllable is combined as high sound
Simple or compound vowel of a Chinese syllable that is female and that promote, by looking into tone tune type table, obtains syllableTone be falling tone (f).
But, in the continuous flow that Tibetan language is actual, tone tune type often occurs that liaison modifies tone, the most same
Syllable is under different context environmentals, and tone tune type can change along with the part of speech of word and change so that existing
According to the compound mode of the initial and the final, the method predicting the tone tune type of syllable by tabling look-up, it is impossible to solution is many
The word unit tone of tone type changes along with the change of part of speech, causes the word list of many tone recognitions part of speech type
The problem that unit's tone forecasting inaccuracy is true.
The Tibetan language tone Forecasting Methodology of present invention offer and system, by originally carrying out participle to pending Tibetan language and literature
Process, it is thus achieved that each word unit and candidate's part of speech thereof, and determine the part of speech of word unit, then according to word unit
Context information and part of speech, it was predicted that the tone of word unit, owing to considering word unit part of speech to word unit
The impact of tone, the problem solving the tone tune type change of many tone recognitions word in the continuous flow of Tibetan language, have
Effect improves the application effect of voice system.
In order to be better understood from technical scheme and technique effect, below with reference to flow chart with concrete
Embodiment be described in detail.
As it is shown in figure 1, be the flow chart of Tibetan language tone Forecasting Methodology that provides of the embodiment of the present invention, including with
Lower step:
Step S01, receives pending Tibetan language text.
Step S02, originally carries out word segmentation processing to pending Tibetan language and literature, obtains each word unit and each word unit
Candidate part of speech.
In actual applications, need to carry out participle according to Tibetan language participle principle, obtain participle unit, described point
Word principle, as carried out participle according to the dictionary for word segmentation built in advance, obtains the word unit after participle and each word list
The candidate part of speech of unit, described dictionary for word segmentation comprises the information such as candidate's part of speech of word unit.
Step S03, determines the type of each word unit.
In embodiments of the present invention, four kinds of different Changing Patterns are met, therefore owing to the tone of Tibetan language changes
Can be according to the change of tone in different context environmentals of the syllable in the word unit of pending Tibetan language text
Change form, is four types by word dividing elements, i.e. many tone recognitions word, function word, affixe, conventional word,
Below the word unit of these four type is described in detail respectively:
1) function word: the function word of different initial consonants is when isolated syllable, and tone is different;If high class initial consonant is in difference
Under the effect of back word adding, tone is h (high), f (fall);Low class initial consonant under the effect of different back word addings,
Tone is l (low), r (liter);In flow, nearly all function word tone is all read as l or r, such as lattice
Auxiliary word, conjunction, preposition, state, modal particle etc., generally read relatively low;
2) many tone recognitions word: when taking on different part of speech, tone recognition is different, such as:
Latin transliteration is khrom skor, and when taking on noun, tone integrated mode is hh, tone combination when taking on verb
Pattern is fr;
3) affixe: pronounce with weak read mode, such as affixe ba, wa, bo, pa, po, ma, mo, pronunciation is similar to
In Mandarin Chinese softly;
4) conventional word: tone exists inherent modified tone rule.As being all read as low-key l when two syllable lists are read
Syllable, form a word, be not to be read as ll, but be often read as lh, i.e. dissyllabic low-key
Become to a high-profile.
In actual applications, can be beforehand through collecting a large amount of Tibetan language data construct word cell type dictionaries, institute
Predicate cell type dictionary includes: function word dictionary, affixe dictionary, many tone recognitions dictionary, then, and will be each
Word unit mates with each word cell type dictionary built in advance, determines the type of each word unit.
In a specific embodiment, determine that the process of the type of word unit is as follows:
Step 1) word unit each in pending text is labeled as conventional word unit, perform step 2);
Step 2) judge whether current word unit is present in function word dictionary;If it is present current word unit
For function word, perform step 5);Otherwise perform step 3);
Step 3) utilize affixe dictionary, it is judged that and whether current word unit contains affixe part, and institute's predicate unit can
With containing multiple affixe parts, affixe also can be separately as a word unit, and described judgement current word unit is
Whether no be that affixe may include that containing affixe part or current word unit
First, current word unit is split into single syllable, it is assumed that syllable number is n, it is assumed that in affixe dictionary
Most syllable number that single affixe comprises are m;
Then, start to select forward min{m, n} syllable and affixe from the ultima of current word unit
Comprising min{m in dictionary, the affixe of n} syllable mates.If the match is successful, then find current word list
Affixe information belonging to syllable in unit;If it fails to match, after selecting current word unit the most successively
Min{m, n}-1 syllable with comprise min{m, the affixe of n}-1 syllable mates;Until choosing current
Word unit ultima;If period, the match is successful, then current word unit contains affixe, and finds and work as
The affixe information belonging to syllable in front word unit;If period mates unsuccessful, then current word unit does not contains
Having affixe, wherein, min{m, n} represent and take smaller in m and n;Particularly, all in current word unit
All syllables that in syllable and affixe dictionary, single affixe is comprised are time the match is successful, and current word unit is word
Sew;
Finally, whether contain affixe part according to current word unit or whether current word unit be affixe, it is judged that
Step to be performed, such as: current word unit contains affixe part or current word unit is affixe,
Perform step 5);Current word unit does not comprise affixe information;Perform step 4);
Step 4) judge whether the part-of-speech information of current word unit is present in many tone recognitions dictionary, if deposited
, then current word unit is many tone recognitions word, performs step 5);If it does not, current word unit is
Conventional word, performs step 5);
Step 5) obtain current word unit type.
Further, the present embodiment can also based on context environmental information, again judge current word unit
Type, such as, according to the context information of the function word/affixe of above-mentioned judgement, judges current word again
Whether unit is function word unit/affixe unit.
Step S04, according to institute's predicate unit context information in described pending Tibetan language text and
The type of institute's predicate unit, determines the part of speech of institute's predicate unit.
The most identical owing to Tibetan language existing very multi word unit, but in different context environmentals its
The phenomenon that part of speech is the most different, accordingly, the tone of these word unit changes the most therewith, i.e. these word unit tool
There is ambiguous category part of speech;Accordingly, it would be desirable to carry out part of speech prediction, the present embodiment according to context environmental residing for word unit
According to word unit context information in pending Tibetan language text and the type of institute's predicate unit, use
The part of speech of word unit is predicted by statistical modeling method, such as:
For function word, according to the context information of described function word, statistical modeling method is used to determine described
The part of speech of function word;
For affixe, all affixes use identical part of speech;
For many tone recognitions word, using the segmentation sequence between function word before and after it as current many tone recognitions word
Participle fragment residing for unit, according to its place context information, uses statistical modeling method to determine described
The part of speech of many tone recognitions word;
For conventional word, all conventional word use identical part of speech.
In a specific embodiment, can as follows the part of speech of word unit be predicted:
Step 1) function word unit part of speech prediction, function word part of speech include case adverbial verb, conjunction, preposition, modal particle,
Auxiliary verb, by considering current function word unit contextual information, uses statistical modeling method to carry out function word optimum
Part of speech is predicted, such as, when carrying out the prediction of function word part of speech, in units of whole sentence, the function word in distich carries out word
Property prediction, when specifically predicting, consider the candidate part-of-speech information of all function words in sentence, carry out optimal path
Judging, the candidate part of speech on optimal path is as the optimum part of speech of function word each in current sentence.
Step 2) prediction of affixe unit part of speech
In the present embodiment, all affixe parts and affixe use identical part of speech, as can be directly by its word
Property is set as affixe.
Step 3) prediction of many tone recognitions word unit part of speech
Using the segmentation sequence between function word unit before and after many tone recognitions word unit as current many tone recognitions
Participle fragment residing for word unit, according to its place context information, uses statistical modeling method to carry out
Excellent part of speech is predicted, determines the part of speech of each many tone recognitions word unit, the context environmental of many tone recognitions word
Information includes:
(1) the candidate part-of-speech information of many tone recognitions word, described candidate part-of-speech information in described participle fragment
During for carrying out participle according to dictionary for word segmentation, the candidate part of speech of the word unit obtained;
(2) current many tone recognitions word position in described participle fragment;
(3) the total number of word unit in current participle fragment;
(4) in the sentence of described many tone recognitions word place, the function word number before many tone recognitions word;
(5) function word part of speech nearest before many tone recognitions word;
(6) in the sentence of described many tone recognitions word place, the function word number after many tone recognitions word;
(7) nearest function word part of speech after many tone recognitions word;
(8) whether current many tone recognitions word contains affixe part.
Step 4) prediction of conventional word unit part of speech.
In the present embodiment, all conventional word use identical part of speech, as can be directly its part of speech be set to often
Rule property word.
Step S05, according to context information and the part of speech of word unit of institute's predicate unit, determines described
The tone information of word unit.
The present embodiment is when predicting the tone of word unit, it is contemplated that the part of speech of word unit is to many tone recognitions word etc.
The impact of the tone of conversion of parts of speech so that the tone of the word unit doped is more accurate.Sound at prediction word unit
Before tune, need word unit to be carried out pretreatment, such as according to the part of speech of word unit:
If current word unit is function word, when carrying out tone prediction, then the initial consonant of function word is set to low class initial consonant;
If current word unit depends on affixe part thereon, then affixe part is marked, with word
Stem portion separates, it is thus achieved that word unit affixe part and the syllable unit of non-affixe part;
If current word unit is many tone recognitions word, and its part of speech is verb or adjective, then with syllable
Split described many tone recognitions word for unit, obtain each syllable unit of many tone recognitions word, otherwise with many sound
Mode transfer formula word is syllable unit;
Then, according to the result of above pretreatment, then the load bearing unit with syllable unit as tone is treated described
In process Tibetan language text, in all word unit, each syllable unit carries out tone prediction.
In the present embodiment, after function word, many tone recognitions word, affixe and conventional word are carried out above-mentioned process,
For unit, each syllable in described pending Tibetan language text is carried out tone prediction with syllable.Wherein, use
Syllable before and after the initial consonant classification of contextual feature such as current syllable, the simple or compound vowel of a Chinese syllable classification of current syllable, current syllable
The initial and the final classification, position in the word unit of current syllable place, the length of current syllable place word unit,
The part of speech etc. of current syllable place word unit, wherein it should be noted that affixe and conventional word are being entered by this case
During the prediction of row tone, all affixes use identical part of speech, and all conventional word use identical part of speech.
Finally, it is also possible to after the tone of prediction word unit, according to the exclusive feature of Tibetan language, to word unit
Tone is adjusted, such as: all affixes are set to weak reading, conventional affixe such as ba, wa, bo, pa, po, ma,
mo。
Step S06, obtains the tone letter of described pending Tibetan language text according to the tone information of each word unit
Breath.
The Tibetan language tone Forecasting Methodology that the embodiment of the present invention provides, by entering the pending Tibetan language text received
Row word segmentation processing, obtains each word unit and candidate's part of speech thereof, and determines the type of each word unit, then foundation
The type of word unit and word unit context information in described pending Tibetan language text, determine described
The part of speech of word unit, and the context information and part of speech thereof according to word unit carry out tone prediction;Make
When the Tibetan language tone carried out according to the present invention is predicted, it is contemplated that the impact on its tone of the part of speech of word unit, solve
Having determined in the continuous flow that Tibetan language is actual, tone tune type often occurs that liaison modifies tone, the most same word unit
Under different context environmentals, the problem that part of speech and tone tune type can change so that in the continuous flow of Tibetan language
Liaison modified tone is more natural.
Accordingly, present invention also offers Tibetan language tone prognoses system, including:
Receiver module 201, is used for receiving pending Tibetan language text;
Word-dividing mode 202, for pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and each
The candidate part of speech of word unit;
Determination type module 203, for determining the type of each word unit;
Part of speech determines module 204, for according to upper and lower in described pending Tibetan language text of institute's predicate unit
Literary composition environmental information and the type of institute's predicate unit, determine the part of speech of institute's predicate unit;
Word unit tone determines module 205, for the context information according to institute's predicate unit and word list
The part of speech of unit, determines the tone information of institute's predicate unit;
Text tone acquisition module 206, obtains described pending Tibetan for the tone information according to each word unit
Chinese language tone information originally.
In the present embodiment, the pending Tibetan language text that receiver module 201 is received can comprise all types
Tibetan language words composition Tibetan language text.
In actual applications, word-dividing mode 202, with word as unit, originally carries out participle to pending Tibetan language and literature,
Obtain each participle unit.Wherein, participle principle adapts to Tibetan language information processing so that computer automatically processes.
Then, the type of each word unit is determined by determination type module 203.In the present embodiment, permissible
Determined the type of each word unit by the method consulted the dictionary, the most described system also includes:
Dictionary creation module 303, for by collecting a large amount of Tibetan language data construct word cell type dictionaries, institute
Predicate cell type dictionary includes: function word dictionary, affixe dictionary, many tone recognitions dictionary;Accordingly, described
Determination type module 203, specifically for by by each word unit and each word cell type word built in advance
Allusion quotation is mated, and determines the type of each word unit.
In the present embodiment, can determine that module 204 determines according to the type of each word unit by part of speech
The part of speech of each word unit, for different types of word unit, part of speech determines that the process that module 204 is carried out is different,
Being specifically as follows, described part of speech determines that module 204 includes:
Function word part of speech determines unit, for the context information according to described function word, uses statistical modeling
Method determines the part of speech of described function word;
Affixe part of speech determines unit, for all affixes are defined as identical part of speech;
Many tone recognitions word part of speech determines unit, and before and after using it, the segmentation sequence between function word is as currently
Participle fragment residing for many tone recognitions word unit, according to its place context information, uses statistical modeling
Method determines the part of speech of described many tone recognitions word;
Conventional word part of speech determines unit, for all conventional word are defined as identical part of speech.
Then, part of speech and the context environmental letter thereof of each word unit that module 204 obtains is determined according to part of speech
By word unit tone, breath, determines that module 205 predicts the tone of each word unit.
In actual applications, institute's predicate unit tone determines that module 205 includes:
Pretreatment unit, is set to low class initial consonant by the initial consonant of function word;It is verb or adjectival many sound by part of speech
Mode transfer formula word, splits described many tone recognitions word in units of syllable, obtains each syllable of many tone recognitions word
Unit;
Tone predicting unit, for the load bearing unit with syllable as tone to institute in described pending Tibetan language text
Each syllable in word unit is had to carry out tone prediction;
Further, the tone of word unit can also be adjusted by this module, and institute's predicate unit tone determines
Module 205 can also include:
Tone adjustment unit, for being set to weak reading by all affixes.
The embodiment of the present invention provide Tibetan language tone prognoses system, by word-dividing mode 202 by receive wait locate
Reason Tibetan language text carries out word segmentation processing, obtains each word unit and candidate's part of speech thereof, and then use pattern determines mould
Block 203 determines the type of each word unit, and believes according to the type of word unit and the context environmental of word unit
By part of speech, breath, determines that module 204 determines the part of speech of each word unit, when the tone carrying out word unit is predicted,
Owing to considering the part of speech of each word unit so that determine, by word unit tone, each word list that module 205 determines
The tone information of unit is more accurate, solves in the continuous flow that Tibetan language is actual, and tone tune type often occurs
Liaison modifies tone, the most same word unit under different context environmentals, the problem that part of speech and tone tune type can change,
Make the liaison modified tone in the continuous flow of Tibetan language more natural.
Each embodiment in this specification all uses the mode gone forward one by one to describe, phase homophase between each embodiment
As part see mutually, what each embodiment stressed is different from other embodiments it
Place.For system embodiment, owing to it is substantially similar to embodiment of the method, so describing
Fairly simple, relevant part sees the part of embodiment of the method and illustrates.System described above is implemented
Example is only that schematically the wherein said unit illustrated as separating component can be or may not be
Physically separate, the parts shown as unit can be or may not be physical location, the most permissible
It is positioned at a place, or can also be distributed on multiple NE.Can select according to the actual needs
Some or all of module therein realizes the purpose of the present embodiment scheme.Those of ordinary skill in the art exist
In the case of not paying creative work, i.e. it is appreciated that and implements.
Being described in detail the embodiment of the present invention above, detailed description of the invention used herein is to this
Bright being set forth, the explanation of above example is only intended to help to understand the method and apparatus of the present invention;With
Time, for one of ordinary skill in the art, according to the thought of the present invention, in detailed description of the invention and application
All will change in scope, in sum, this specification content should not be construed as limitation of the present invention.
Claims (9)
1. a Tibetan language tone Forecasting Methodology, it is characterised in that including:
Receive pending Tibetan language text;
Pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and the candidate part of speech of each word unit;
Determine the type of each word unit;
According to institute's predicate unit context information in described pending Tibetan language text and institute's predicate list
The type of unit, determines the part of speech of institute's predicate unit;
Context information according to institute's predicate unit and the part of speech of word unit, determine the sound of institute's predicate unit
Tune information;
Tone information according to each word unit obtains the tone information of described pending Tibetan language text.
Method the most according to claim 1, it is characterised in that the type of institute's predicate unit include with
Lower any one or more: many tone recognitions word, function word, affixe, conventional word;
Described method also includes:
By collecting a large amount of Tibetan language data construct word cell type dictionaries, institute's predicate cell type dictionary includes:
Function word dictionary, affixe dictionary, many tone recognitions dictionary;
The described type determining each word unit includes:
By being mated with each word cell type dictionary built in advance by each word unit, determine each word list
The type of unit.
Method the most according to claim 2, it is characterised in that described according to institute's predicate unit in institute
State the context information in pending Tibetan language text and the type of institute's predicate unit, determine institute's predicate unit
Part of speech include:
For function word, according to the context information of described function word, statistical modeling method is used to determine described
The part of speech of function word;
For affixe, all affixes use identical part of speech;
For many tone recognitions word, using the segmentation sequence between function word before and after it as current many tone recognitions word
Participle fragment residing for unit, according to its place context information, uses statistical modeling method to determine described
The part of speech of many tone recognitions word;
For conventional word, all conventional word use identical part of speech.
Method the most according to claim 3, it is characterised in that described many tone recognitions word upper and lower
Literary composition environmental information includes:
(1) the candidate part-of-speech information of many tone recognitions word, described candidate part-of-speech information in described participle fragment
During for carrying out participle according to dictionary for word segmentation, the candidate part of speech of the word unit obtained;
(2) current many tone recognitions word position in described participle fragment;
(3) the total number of word unit in current participle fragment;
(4) in the sentence of described many tone recognitions word place, the function word number before many tone recognitions word;
(5) function word part of speech nearest before many tone recognitions word;
(6) in the sentence of described many tone recognitions word place, the function word number after many tone recognitions word;
(7) nearest function word part of speech after many tone recognitions word;
(8) whether current many tone recognitions word contains affixe part.
Method the most according to claim 1, it is characterised in that the described context according to word unit
Environmental information and word unit part of speech, it was predicted that word unit tone includes:
If current word unit is function word, then the initial consonant of function word is set to low class initial consonant;
If current word unit is many tone recognitions word, and its part of speech is verb or adjective, then with syllable
Split described many tone recognitions word for unit, obtain each syllable unit of many tone recognitions word;
If current word unit comprises affixe part, then the affixe part depending on word unit is marked,
Separate with stem part, it is thus achieved that the syllable unit of affixe part;
Load bearing unit with syllable as tone is to each sound in all word unit in described pending Tibetan language text
Joint carries out tone prediction;
All affixes are set to weak reading.
6. a Tibetan language tone prognoses system, it is characterised in that including:
Receiver module, is used for receiving pending Tibetan language text;
Word-dividing mode, for pending Tibetan language and literature is originally carried out word segmentation processing, obtains each word unit and each word list
The candidate part of speech of unit;
Determination type module, for determining the type of each word unit;
Part of speech determines module, for according to institute's predicate unit context ring in described pending Tibetan language text
Environment information and the type of institute's predicate unit, determine the part of speech of institute's predicate unit;
Word unit tone determines module, for according to the context information of institute's predicate unit and word unit
Part of speech, determines the tone information of institute's predicate unit;
Text tone acquisition module, obtains described pending Tibetan language and literature for the tone information according to each word unit
This tone information.
System the most according to claim 6, it is characterised in that described system also includes:
Dictionary creation module, for by collecting a large amount of Tibetan language data construct word cell type dictionaries, institute's predicate
Cell type dictionary includes: function word dictionary, affixe dictionary, many tone recognitions dictionary;
Described determination type module, specifically for by by each word unit and each word unit class built in advance
Type dictionary mates, and determines the type of each word unit.
System the most according to claim 7, it is characterised in that described part of speech determines that module includes:
Function word part of speech determines unit, for the context information according to described function word, uses statistical modeling
Method determines the part of speech of described function word;
Affixe part of speech determines unit, for all affixes are defined as same part of speech;
Many tone recognitions word part of speech determines unit, and before and after using it, the segmentation sequence between function word is as currently
Participle fragment residing for many tone recognitions word unit, according to its place context information, uses statistical modeling
Method determines the part of speech of described many tone recognitions word;
Conventional word part of speech determines unit, for all conventional word are defined as same part of speech.
System the most according to claim 6, it is characterised in that institute's predicate unit tone determines module
Including:
Pretreatment unit, is set to low class initial consonant by the initial consonant of function word;It is verb or adjectival many sound by part of speech
Mode transfer formula word, splits described many tone recognitions word in units of syllable, obtains each syllable of many tone recognitions word
Unit;The affixe part depending on word unit is marked, separates with stem part, it is thus achieved that affixe portion
The syllable unit divided;
Tone predicting unit, for the load bearing unit with syllable as tone to institute in described pending Tibetan language text
Each syllable in word unit is had to carry out tone prediction;
Tone adjustment unit, for being set to weak reading by all affixes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510332794.6A CN106294311B (en) | 2015-06-12 | 2015-06-12 | A kind of Tibetan language tone prediction technique and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510332794.6A CN106294311B (en) | 2015-06-12 | 2015-06-12 | A kind of Tibetan language tone prediction technique and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106294311A true CN106294311A (en) | 2017-01-04 |
CN106294311B CN106294311B (en) | 2019-03-19 |
Family
ID=57650055
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510332794.6A Active CN106294311B (en) | 2015-06-12 | 2015-06-12 | A kind of Tibetan language tone prediction technique and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294311B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111444676A (en) * | 2018-12-28 | 2020-07-24 | 北京深知无限人工智能研究院有限公司 | Part-of-speech tagging method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103440236A (en) * | 2013-09-16 | 2013-12-11 | 中央民族大学 | United labeling method for syntax of Tibet language and semantic roles |
CN104217713A (en) * | 2014-07-15 | 2014-12-17 | 西北师范大学 | Tibetan-Chinese speech synthesis method and device |
CN104538025A (en) * | 2014-12-23 | 2015-04-22 | 西北师范大学 | Method and device for converting gestures to Chinese and Tibetan bilingual voices |
-
2015
- 2015-06-12 CN CN201510332794.6A patent/CN106294311B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103440236A (en) * | 2013-09-16 | 2013-12-11 | 中央民族大学 | United labeling method for syntax of Tibet language and semantic roles |
CN104217713A (en) * | 2014-07-15 | 2014-12-17 | 西北师范大学 | Tibetan-Chinese speech synthesis method and device |
CN104538025A (en) * | 2014-12-23 | 2015-04-22 | 西北师范大学 | Method and device for converting gestures to Chinese and Tibetan bilingual voices |
Non-Patent Citations (3)
Title |
---|
林秀艳: "藏汉语声调异同之比较", 《阿坝师范高等专科学校学报》 * |
索南扎西: "藏语语音合成关键技术研究", 《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》 * |
羊毛卓么: "藏文词性自动标注系统的研究与实现", 《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111444676A (en) * | 2018-12-28 | 2020-07-24 | 北京深知无限人工智能研究院有限公司 | Part-of-speech tagging method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106294311B (en) | 2019-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Protopapas et al. | IPLR: An online resource for Greek word-level and sublexical information | |
WO2008107305A2 (en) | Search-based word segmentation method and device for language without word boundary tag | |
Gorman et al. | Improving homograph disambiguation with supervised machine learning | |
CN105989833A (en) | Multilingual mixed-language text character-pronunciation conversion method and system | |
Fernando et al. | Comprehensive part-of-speech tag set and svm based pos tagger for sinhala | |
Kirchhoff et al. | Novel speech recognition models for Arabic | |
JP2008225963A (en) | Machine translation device, replacement dictionary creating device, machine translation method, replacement dictionary creating method, and program | |
Sun et al. | Knowledge distillation from bert in pre-training and fine-tuning for polyphone disambiguation | |
Rathod et al. | Survey of various POS tagging techniques for Indian regional languages | |
Bar-Haim et al. | Choosing an optimal architecture for segmentation and POS-tagging of Modern Hebrew | |
Chennoufi et al. | Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization | |
Hatab et al. | Enhancing deep learning with embedded features for Arabic named entity recognition | |
CN106294311A (en) | A kind of Tibetan language tone Forecasting Methodology and system | |
Leidig et al. | Automatic detection of anglicisms for the pronunciation dictionary generation: a case study on our German IT corpus. | |
Eldos | Arabic text data mining: A root-based hierarchical indexing model | |
CN106294310A (en) | A kind of Tibetan language tone Forecasting Methodology and system | |
Black et al. | Syntactic annotation: linguistic aspects of grammatical tagging and skeleton parsing | |
Nunsanga et al. | Part-of-speech tagging for mizo language using conditional random field | |
Moumen et al. | Arabic diacritization with gated recurrent unit | |
Saychum et al. | Efficient Thai Grapheme-to-Phoneme Conversion Using CRF-Based Joint Sequence Modeling. | |
Olivo et al. | CRFPOST: Part-of-Speech Tagger for Filipino Texts using Conditional Random Fields | |
Rajendran et al. | Text processing for developing unrestricted Tamil text to speech synthesis system | |
CN111090720A (en) | Hot word adding method and device | |
Švec et al. | Automatic correction of i/y spelling in Czech ASR output | |
Verulkar et al. | Transliterated search of Hindi lyrics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |