CN105426355A

CN105426355A - Syllabic size based method and apparatus for identifying Tibetan syntax chunk

Info

Publication number: CN105426355A
Application number: CN201510711234.1A
Authority: CN
Inventors: 史树敏; 王天航; 黄河燕; 龙从军
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2016-03-23

Abstract

The present invention relates to a syllabic size based method and apparatus for identifying a Tibetan syntax chunk, and belongs to the technical field of machine translation in computer application technology. The method according to the present invention comprises: firstly, preprocessing an original Tibetan corpus to delete non-Tibetan text; then, performing identification by using a pre-trained syntax marker identification model M1, to acquire a syntax marker type; restoring text with the syntax marker type being an adhesion form, to acquire a standard corpus without the adhesion form; and finally, for the standard corpus, using a pre-trained syntax chunk identification model M2 to directly perform chunk identification on a functional chunk. Compared with the prior art, in the method and apparatus provided by the present invention, the functional chunk is directly identified without word segmentation and part-of-speech tagging, thereby reducing time and space costs of preprocessing while avoiding poor performance in identifying function chunks caused by inaccuracy of word segmentation and part-of-speech tagging.

Description

A kind of Tibetan language syntactic groups block identifying method of syllable granularity and device

Technical field

The invention belongs to Computer Applied Technology field, relate to a kind of Tibetan language syntactic groups block identifying method based on syllable granularity in fields such as being applied to mechanical translation and device.

Background technology

Chunk identifies it is the study hotspot of natural language processing field automatically.Chunk parsing is as a kind of preprocessing means, greatly can reduce the complicacy of the syntactic analysis process based on phrase, for further syntactic analysis and semantic analysis etc. provide infrastructural support, syntactic analysis is simplified to a certain extent, has therefore been applied to many utility systems such as mechanical translation, question answering system.

The object of Tibetan language syntax chunk Study of recognition is the border and the type that correctly mark out the syntax chunk forming Tibetan language sentence.Existing chunk parsing research, all the identification carrying out syntax chunk on the basis of language material being carried out to participle and part-of-speech tagging again, but Tibetan language participle and part-of-speech tagging effect still do not reach actual demand at present, because the error rate of participle and part-of-speech tagging is higher, greatly reduce the accuracy of follow-up phase identification Tibetan language chunk.The present invention finds by going deep into language analysis, due to Tibetan language self inherent characteristics, more in esse syntactic markers in Tibetan language, contain the effective semantic information to chunk type identification, if directly identified syntactic marker, the object of chunk parsing can be reached.

Summary of the invention

The object of the invention is the identification problem in order to solve syntax chunk in Tibetan language Intelligent Information Processing, a kind of Tibetan language syntactic groups block identifying method based on syllable granularity is proposed, this method can be directly granularity unit with syllable, Tibetan language syntax chunk is identified, avoid in existing conventional method and first must complete Tibetan language participle and part-of-speech tagging drawback, decrease participle and the cost of the time and space needed for part-of-speech tagging pre-service, also efficiently solve because of participle and part-of-speech tagging accuracy is low and problem that continued syntactical chunk parsing performance that is that directly cause reduces simultaneously.

A Tibetan language syntactic groups block identifying method for syllable granularity, comprises following concrete steps:

Step one: Text Pretreatment is carried out to input language material and obtains standardization sentence language material S;

Step 2: the syntactic marker model of cognition M that training in advance is good is adopted to S ₁carry out identification and obtain syntactic marker type;

Step 3: the syntactic marker type obtained step 2 is that the text of form of sticking together reduces and obtains not containing the standard corpus of the form of sticking together;

Step 4: the syntax chunk model of cognition M that training in advance is good is adopted to the standard corpus that step 3 obtains ₂carry out chunk parsing and obtain type identification result.

In the present invention's specific embodiment, described in step one, the concrete steps of Text Pretreatment comprise:

1. create and collect corpus data, corpus Data Source includes but not limited to: teaching material, Scientific Magazine, periodical, newspaper and website Tibetan language text;

2. pair above-mentioned corpus data carry out pre-service, delete nonsignificant data; Described nonsignificant data refers to the text of other language adulterated in Tibetan language language material;

3. carrying out sentence cutting to above-mentioned text further, is with the text sequence of sentence unit by material segmentation.

In the present invention's specific embodiment, the concrete steps of described Tibetan language syntactic marker model of cognition training comprise:

1. pair language material carries out syntactic marker mark sentence by sentence, creates the syllable syntax mark system of corpus;

2. language material after mark is brought into CRFs by specific feature templates, trains syntactic marker CRFs model of cognition.

In the present invention's instantiation, described in stick together form reduction concrete steps comprise:

1., according to syntactic marker type, be that the syllable of form of sticking together carries out syllable splitting according to different rules to the form of sticking together by wherein result, be reduced to single syllable text;

2., after cutting, between the syllable split out, mend ". ", complete the reduction work to raw material.

In the present invention's specific embodiment, the concrete steps of described syntax chunk model of cognition training comprise:

1. chunk type mark is carried out to each sentence, create the chunk mark system of corpus, namely enter in different chunk type to each syllabification in each;

2. above-mentioned language material is brought into CRFs by specific feature templates, trains chunk type CRFs model of cognition.

A Tibetan language syntax chunk recognition device for syllable granularity, comprise connect successively Text Pretreatment module, syntactic marker identification module, stick together form recovery module and chunk type identification module;

Text Pretreatment module is used for processing input language material text, and sentence cutting obtains the sentence that can be used for syntax chunk parsing;

Syntactic marker identification module is used for carrying out Syntactic Recognition to the sentence that Text Pretreatment module exports and obtains syntactic marker;

Stick together form recovery module to reduce to the form of sticking together in original sentence for the syntactic marker that exports according to syntactic marker identification module and obtain writing form sentence with the non-glutinous of syntactic marker;

Chunk type identification module is used for carrying out the identification of syntactic groups block type according to the sentence sticking together the output of form recovery module and obtains recognition result and export.

Beneficial effect

To original Tibetan language text without participle and part-of-speech tagging link, but in units of syllable, utilize the own characteristic of Tibetan language directly to carry out chunk type identification, propose out a kind of new Tibetan language chunk parsing method, infrastructural support can be provided for further Tibetan language syntactic analysis, semantic analysis even depth Intelligent treatment.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, in describing embodiment below, required accompanying drawing is introduced briefly.Accompanying drawing is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings, wherein:

Fig. 1 is the Tibetan language syntactic groups block identifying method schematic flow sheet of a kind of syllable granularity of the embodiment of the present invention.

Fig. 2 is the Tibetan language syntax chunk recognition device structural representation of a kind of syllable granularity of the embodiment of the present invention.

Embodiment

Remove to the technical scheme in the embodiment of the present invention, intactly describe below, obviously, described embodiment is only a part of embodiment of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

The Tibetan language syntactic groups block identifying method of a kind of syllable granularity of the present invention, as shown in Figure 1, comprises the following steps:

One, Text Pretreatment: namely obtain Tibetan language urtext, and subordinate sentence process is carried out to it.

The mode adopting manual entry and web crawlers to crawl network text in the present embodiment creates and collects corpus data, then deletes wherein insignificant data, finally utilizes the method for rule to utilize subordinate sentence to identify the urtext obtained cut into the text in units of simple sentence, concrete cutting statement example is as shown in table 1:

Table 1 Tibetan language Text Pretreatment (sentence cutting) example

Two, syntactic marker identification

(1) syntactic marker model of cognition training

Need in the present embodiment to adopt to S the syntactic marker model of cognition M that training in advance is good ₁carry out identifying thus obtain syntactic marker type, therefore needing first to train syntactic marker model of cognition M by corpus ₁due to: Tibetan language, in units of syllable, separates with ". " between syllable, forms word form sentence further by different terms collocation by the collocation of different syllables, this point and Chinese are very similar, and Chinese is also the sentence that composition is complete further in units of syllable.But modern Tibetan writes a kind of special writing style, namely sticks together form, two to three syllables can stick together together by it, and centre can not be split with ". ".The syntactic marker of Tibetan language is abundanter, syntactic marker mentioned here refers in modern Tibetan, the syntax chunk that some formal notations (comprising case marking and auxiliary word mark) are divided into function different sentence, as represent the time, place the adverbial modifier chunk after locative case may be had to mark, agentive case may be had after subject chunk to mark, may have object case marking etc. after object object chunk, these marks are bases of the chunk parsing based on syllable granularity.But some lattice and auxiliary word mark due to character calligraph reason cause two syllables be abbreviated as a syllable namely this section beginning described in stick together form, in order to lattice and auxiliary word mark can be made full use of, we not only need lattice and the auxiliary word mark of individual syllables, also need those to form the lattice that stick together in syllable and auxiliary word mark is separated, the common recognition feature forming syntactic function chunk border of these marks.Therefore, training syntactic marker model of cognition M herein ₁trained by the mode of machine learning exactly, in the present embodiment, for condition random field (CRFs) model of cognition, training process be described:

First the text obtained step one carries out syntactic marker and manually marks, and marking types is divided into 6 types, SS, VV, RR, CC, M, N.Described SS represents syllable (executing lattice/instrumental (case) mark) sticks together the syllable of form, and described VV represents syllable (genetive marker) sticks together the syllable of form, and described RR represents syllable (dative/position case marking) sticks together the syllable of form, and described CC represents syllable (punctuation words) sticks together the syllable of form, and described M represents the non-case marking and the auxiliary word mark that stick together form individual syllables, and described N represents the syllable of non-syntactic marker, and the syllable namely except SS, VV, RR, CC, M all marks with N.

Stick together formal notation, instantiation is as shown in table 2.

Table 2 sticks together formal notation example

Secondly, the text obtained based on step one and artificial annotation results are set up specific CRFs feature templates for CRFs model training and are obtained final syntactic marker CRFs model of cognition M ₁, therefore the selection of template is most important, and according to great many of experiments, the feature templates of CRFs is selected as follows in the present embodiment: syllable font style characteristic and syntactic marker; Syllable font style characteristic: the font getting current syllable and front and back adjacent syllable thereof is as feature, and as preferably, arranging window is 5, namely gets each two adjacent syllables of current syllable and front and back thereof; Syntactic marker: get the syntactic marker type of current word as syntactic marker.

Further, in Templated process, due to sentence beginning or end up less than two syllables before or after it, adopt and fill out NULL process, with sentence in step one (2) for example, suppose that current syllable is due to the beginning that this syllable is sentence, two syllables before it are empty, and therefore characteristic of correspondence is NULL and NULL, and its two syllables are below respectively therefore this syllable characteristic of correspondence is nULL, NULL, other words the like, what obtain is as shown in table 3 for the templating language material of training.

Table 3 syntactic marker recognition template example

(2) identify

Obtaining M ₁after, only input language material need be constructed syllable font style characteristic according to the syllable font feature construction rule in feature templates and give M ₁carry out identifying and can obtain syntactic marker result corresponding to font style characteristic, namely corresponding SS, CC, VV, RR, M, N.

Three, form reduction is sticked together

The recognition result obtained upper step is form of the sticking together syllable of SS, CC, VV, RR, and utilize the marker combination rule of the form of sticking together of step 2, opened by the syllable sticked together again cutting, centre is filled ". ", is reduced into two original individual syllables.As in step 2 it sticks together formal notation is RR, utilizes the rule of correspondence, is split as concrete cutting rule is as shown in table 4, and wherein "/" represents the place carrying out cutting.

Table 4 cutting method example

Four, syntax chunk parsing

(1) syntactic marker model of cognition training

Need in this step to adopt to language material the syntax chunk model of cognition M that training in advance is good ₂carry out chunk parsing and obtain chunk type recognition result, therefore need first to train syntax chunk model of cognition M by corpus ₂, below or for CRFs model of cognition, training process is described:

First, on step 3 basis, chunk type mark is carried out to language material and sets up chunk type mark system.Described chunk type, comprising: subject block (S), predicate block (P), object block (O), adverbial modifier's block (D), complement block (C) and in order to process convenient and syntactic marker block (M) that is that set up.Syntactic marker mentioned here refers in modern Tibetan, the syntax chunk that some formal notations (comprising case marking and auxiliary word mark) are divided into function different sentence, as represent the time, place the adverbial modifier chunk after locative case may be had to mark, agentive case may be had after subject chunk to mark, may have object case marking etc. after object object chunk, these marks are bases of the chunk parsing based on syllable granularity.

Secondly, in this step, the object of carrying out training obtains syntax chunk model of cognition M exactly ₂, and set up model of cognition, feature selecting is most important, and according to abundant experimental results, the specific feature templates of the present embodiment CRFs is selected as follows: syllable characteristic and chunk type mark thereof.

Syllable characteristic: get current syllable and front and back adjacent syllable thereof, and the syntactic marker of current syllable is as syllable characteristic, further, when setting window is 5, namely gets current syllable and its forward and backward each two syllables can obtain better effect.

In Templated process, due to the beginning of sentence or ending place before or after it less than two syllables, adopt and fill out NULL process, with sentence in table 1 (2) for example, its correspondence to stick together formal notation as follows:

Adopted the method for step 3 to carry out cutting and obtained following sequence:

Easy in order to process in the present embodiment, syntactic marker SS, CC, VV and RR after cutting are abbreviated as S, C, V and R respectively.

Chunk type marks the block mainly marked belonging to current word, in order to the convenience of subsequent treatment, block belonging to current word not only marks by the present embodiment, also mark the position that it is residing in affiliated block, with 2 letter representations, this syllable of previous letter representation is in the position of chunk, B represents beginning, I represents middle, and E represents end, if only have a syllable, only represents with B; A rear letter representation chunk type, to punctuation mark directly mark with B.Such as: " B-S " is expressed as the beginning syllable that this syllable is subject block, what obtain successively is as shown in table 5 for the templating example of training.

Table 5 syntactic groups block type recognition template example

(2) identify

Obtaining M ₂after, only input language material need be constructed syllable characteristic according to the syllable characteristic structure rule in above-mentioned feature templates and give M ₂carry out identifying and can obtain chunk type mark result corresponding to this syllable, the mark result of each syllable is updated to output in the standard corpus of step 3 and can obtains final syntactic groups block type recognition result.

According to the Tibetan language syntactic groups block identifying method of above-mentioned a kind of syllable granularity, the Tibetan language syntax chunk recognition device of a set of a kind of syllable granularity can be built, for carrying out chunk parsing to input language material and obtain the type of chunk, below this recognition device is described in detail:

The Tibetan language syntax chunk recognition device of a kind of syllable granularity of the present invention, as shown in Figure 2, comprise connect successively Text Pretreatment module, syntactic marker identification module, stick together form recovery module and chunk type identification module;

More than show and describe ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; what describe in above-described embodiment and instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications; these changes and improvements are all in the claimed scope of the invention, and application claims protection domain is defined by appending claims and equivalent thereof.

Claims

1. a Tibetan language syntactic groups block identifying method for syllable granularity, is characterized in that, comprise the following steps:

Step 4: the syntax chunk model of cognition M that training in advance is good is adopted to the standard corpus that step 3 obtains ₂carry out chunk parsing and obtain chunk type recognition result.

2. the Tibetan language syntactic groups block identifying method of a kind of syllable granularity according to claim 1, it is characterized in that, described Text Pretreatment comprises following content:

(1). create and collect corpus data, corpus Data Source includes but not limited to: teaching material, Scientific Magazine, periodical, newspaper and website Tibetan language text;

(2). pre-service is carried out to above-mentioned corpus data, deletes non-Tibetan language language text;

(3). carrying out sentence cutting to above-mentioned text further, is with the text sequence of sentence unit by material segmentation.

3. the Tibetan language syntactic groups block identifying method of a kind of syllable granularity according to claim 1, is characterized in that, the syntactic marker model of cognition M that described training in advance is good ₁for being trained the model obtained by following steps:

(1). sentence by sentence syntactic marker mark is carried out to S, creates the syllable syntax mark system of corpus;

(2). language material after mark is brought into CRFs by specific feature templates, trains syntactic marker CRFs model of cognition.

4. the Tibetan language syntactic groups block identifying method of a kind of syllable granularity according to claim 3, is characterized in that, described specific feature templates is syllable font style characteristic and syntactic marker.

5. the Tibetan language syntactic groups block identifying method of a kind of syllable granularity according to claim 4, is characterized in that, described syllable font style characteristic gets each two adjacent syllables of current syllable and front and back thereof, and when before and after it, adjacent syllable is less than two, fills with NULL.

6. the Tibetan language syntactic groups block identifying method of a kind of syllable granularity according to claim 1, is characterized in that, described reduction to the form of sticking together is completed by following steps:

(1). according to syntactic marker type, be that the syllable of form of sticking together carries out syllable splitting according to different rules to the form of sticking together by wherein result, be reduced to single syllable text;

(2). after cutting, between the syllable split out, mend " ", complete the reduction work to raw material.

7. the Tibetan language syntactic groups block identifying method of a kind of syllable granularity according to claim 1, is characterized in that, the syntax chunk model of cognition M that described training in advance is good ₂for being trained the model obtained by following steps:

(1). chunk type mark is carried out to each sentence, creates the chunk mark system of corpus, namely enter in different chunk type to each syllabification in each;

(2). above-mentioned language material is brought into CRFs by specific feature templates, trains chunk type CRFs model of cognition.

8. according to the Tibetan language syntactic groups block identifying method of the arbitrary described a kind of syllable granularity of claim 1-7, it is characterized in that, described specific feature templates is syllable characteristic and chunk type mark thereof.

9. the Tibetan language syntactic groups block identifying method of a kind of syllable granularity according to claim 7, is characterized in that, described syllable characteristic gets each two adjacent syllables of current syllable and front and back thereof, and the syntactic marker of current syllable.

10. a Tibetan language syntax chunk recognition device for syllable granularity, is characterized in that, comprise connect successively Text Pretreatment module, syntactic marker identification module, stick together form recovery module and chunk type identification module;

Text Pretreatment module is used for carrying out non-Tibetan language language delete processing to input language material text, and sentence cutting obtains the sentence that can be used for syntax chunk parsing;

The sentence that syntactic marker identification module is used for Text Pretreatment module exports adopts the syntactic marker model of cognition M that training in advance is good ₁carry out Syntactic Recognition and obtain syntactic marker;

The sentence that chunk type identification module is used for according to sticking together the output of form recovery module adopts the syntax chunk model of cognition M that training in advance is good ₂carry out the identification of syntactic groups block type obtain recognition result and export.