CN109255120A - A kind of Laotian segmenting method - Google Patents

A kind of Laotian segmenting method Download PDF

Info

Publication number
CN109255120A
CN109255120A CN201810810863.3A CN201810810863A CN109255120A CN 109255120 A CN109255120 A CN 109255120A CN 201810810863 A CN201810810863 A CN 201810810863A CN 109255120 A CN109255120 A CN 109255120A
Authority
CN
China
Prior art keywords
laotian
syllable
refers
consonant
vowel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810810863.3A
Other languages
Chinese (zh)
Inventor
周兰江
何力
张建安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201810810863.3A priority Critical patent/CN109255120A/en
Publication of CN109255120A publication Critical patent/CN109255120A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of Laotian segmenting method, a Laotian syllable dictionary step are as follows: step 1: is constructed according to Laotian syllable rule;Step 2: by Laotian sentence to be segmented according to Laotian syllable dictionary, the sequence being made of syllable one by one is divided into according to string matching algorithm, and these syllables are expressed as term vector;Step 3: using these vectors as the input of trained neural network model, by an output sequence is calculated;Step 4: decoding these sequences using prediction algorithm, generate the mark label of syllable;Step 5: Laotian sentence prediction word segmentation result is obtained according to mark label.The present invention learns the participle corpus that Laotian has marked using the powerful Textual study ability of neural network, trains a network model, and prediction algorithm is cooperated to calculate the mark label of Laotian syllable, to achieve the effect that Laotian segments.

Description

A kind of Laotian segmenting method
Technical field
The present invention relates to a kind of Laotian segmenting methods, belong to natural language processing and depth learning technology field.
Background technique
Laotian participle is exactly by certain regular partition by a series of Laotian character into single Laos's sequence of terms Process.Laotian sentence significantly divides mark unlike having between the word and word in the sentences such as Great Britain and France, and Chinese sentence exists It is similar in structure, all it is continuous language, has the characteristics that character continuous writing, but during natural language processing, Word is most basic and significant linguistic unit, therefore has very important significance to the segmentation of word.Laotian word by There is one or more the syllable of unalterable rules to constitute.Laotian shares vowel 28, wherein 18 single vowels, 6 compound members Sound, 4 special vowels.Consonant 32, it is divided into high, normal, basic three groups, high consonant and low consonant each 12, middle consonant 8.In addition, also There are 17 compound consonants being of little use.The consonant ending of a final has 8, and tone has 6.The method of Laotian participle mainly uses at present The model of string matching based on dictionary, matching algorithm mainly have maximum forward matching algorithm, maximum reverse matching algorithm and Bi-directional matching algorithm etc., this method are divided into word sequence by being compared with dictionary, by Laotian sentence, have reached participle mesh 's.The disadvantages of this method is not can effectively solve the identification problem of unregistered word in dictionary and segmentation ambiguity problem, due to language The complexity and diversity for saying itself, there is no the dictionary comprising all Laotian words, along on internet almost without It can identify the dictionary data of Laotian, so that the building of Laos's dictionary is difficult, scale is more limited, and segmentation ambiguity is due to base Be simple matched character string in the method for dictionary matching, can not identify the context semantic information of text and caused by participle Mistake.
Summary of the invention
The present invention provides a kind of Laotian segmenting methods, for being effectively divided into Laotian sentence by word group At sequence.
The technical scheme is that a kind of Laotian segmenting method, specific step is as follows for the method:
Step 1: a Laotian syllable dictionary is constructed according to Laotian syllable rule;
Step 2: by Laotian sentence to be segmented according to Laotian syllable dictionary, being divided into according to string matching algorithm The sequence being made of syllable one by one, and these syllables are expressed as term vector;
Step 3: using these vectors as the input of trained neural network model, by an output is calculated Sequence;
Step 4: decoding these sequences using prediction algorithm, generate the mark label of syllable;
Step 5: Laotian sentence prediction word segmentation result is obtained according to mark label.
The Laotian syllable rule refers to the syllable rules of pronunciation of Laotian, mutually spelled by consonant plus tone, consonant with vowel, Consonant is mutually spelled with vowel plus tone, consonant and vowel are mutually spelled plus dead closed syllable, consonant and vowel mutually spell and add closed syllable and consonant living It is mutually spelled with vowel plus closed syllable living adds the spelling rules of tone to constitute.
The Laotian sentence, which refers to, to be made of Laotian letter or by English alphabet, Arabic numerals, Laotian spy Different one or more of symbol and Laotian punctuation mark are formed plus Laotian letter.
The string matching algorithm refers to the reverse maximum matching algorithm based on dictionary.
The term vector refers to the corresponding term vector of Laotian syllable using term vector model Word2vec training, and makees For the input vector of neural network model.
The neural network model refers to two-way shot and long term Memory Neural Networks model.
The prediction algorithm refers to viterbi algorithm.
The mark label refers to set { B, M, E, S };Wherein B refers to that syllable occupy prefix, and M refers to that syllable occupy in word, and E refers to Syllable occupy suffix, and S refers to syllable individually at word.
The beneficial effects of the present invention are: the present invention has marked Laotian using the powerful Textual study ability of neural network Participle corpus learnt, train a network model, and cooperate prediction algorithm calculate Laotian syllable mark label, To achieve the effect that Laotian segments.The participle performance of this method is better than the segmenting method based on dictionary, and efficiently solves Dictionary unknown word identification and ambiguity problem.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is the alphabet of Laotian consonant, vowel, tone and tail consonant;
Fig. 3 is the spelling rules table of Laotian vowel tailing consonant;
Fig. 4 is the part Laos dictionary of building.
Specific embodiment
With reference to the accompanying drawings and examples, the invention will be further described, but the contents of the present invention be not limited to it is described Range.
Embodiment 1: as Figure 1-Figure 4, a kind of Laotian segmenting method, step 1: being constructed according to Laotian syllable rule One Laotian syllable dictionary.Laotian syllable rule be add tone, consonant mutually to spell with vowel by consonant, consonant and vowel are mutually spelled Add tone, consonant mutually to spell tailing consonant and consonant with vowel and mutually spells the syllable spelling rule that tailing consonant adds tone to be constituted with vowel Then, wherein consonant, vowel (including single vowel, compound vowel and special vowel), tone and tail consonant (including dead closed syllable and work Closed syllable) as shown in Fig. 2, x voiced consonant position in figure, can be replaced by any consonant.That pays special attention to has consonant and spy Different vowel is unable to tailing consonant when mutually spelling, consonant is mutually spelled with single vowel, compound vowel cannot add tone, vowel when adding dead closed syllable The spelling of tailing consonant has certain particularity, and specific spelling rules, can be by as shown in figure 3, x voiced consonant position in figure Any consonant replaces, -- it indicates without this syllable;The part screenshot of the Laotian syllable dictionary of composition as shown in figure 4, be in figure The Laotian syllable that consonant mutually pieces together with vowel.
Step 2: by Laotian sentence to be segmentedFoundation (the Laotian sentence refers to be made of or by English alphabet, Arabic number syllable dictionary in step 1 Laotian letter One or more of word, Laotian additional character and Laotian punctuation mark are formed plus Laotian letter, wherein special Symbol includes percentage sign, forward slash and back slash etc., and punctuation mark includes comma, fullstop, branch, question mark and exclamation mark), according to The existing reverse maximum matching algorithm based on dictionary is divided into the sequence being made of syllable one by one;And using existing word to These syllables are trained to corresponding term vector by amount model (Word2vec), and the length of term vector used in this method is 128.
Step 3: the input of trained two-way shot and long term Memory Neural Networks model is the word of 128 dimensions in step 2 Vector, the model are trained using mark corpus has been segmented on a large scale, and hidden layer uses tanh as activation primitive, output layer Use softmax as activation primitive, its advantage is that the Dependency Specification of relatively long distance can be retained, can more effectively utilize sound Save the contextual information of sequence;An output sequence is obtained after calculating, which is the probability matrix of one 4 dimension, is indicated The probability of the corresponding mark of syllable.
Step 4: the probability matrix in step 3 carrys out decoding sequence mark as viterbi algorithm input (most to be had for finding Possible hidden state sequence), syllable is obtained by the sum of the transition probability Marking Probability corresponding with syllable calculated between each mark The Marking Probability of sequence, and the mark label of syllable sequence is generated, transition probability is calculated using equiprobability, and mark label is Refer to set { B, M, E, S }, wherein B refers to that syllable occupy prefix, and M refers to that syllable occupy in word, and E refers to that syllable occupy suffix, and S refers to syllable list Alone become word.
Step 5: Laotian sentence prediction word segmentation result is obtained according to the sequence labelling label in step 4, as shown in Figure 1 Laotian sentence word segmentation result can be obtained by label for labelling result { B, M, M, E, S, B, M, E, B, M, M, E, S } in syllable sequenceIt is separated between word and word using space.
Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (8)

1. a kind of Laotian segmenting method, it is characterised in that: specific step is as follows for the method:
Step 1: a Laotian syllable dictionary is constructed according to Laotian syllable rule;
Step 2: by Laotian sentence to be segmented according to Laotian syllable dictionary, being divided into according to string matching algorithm by one The sequence of each and every one syllable composition, and these syllables are expressed as term vector;
Step 3: using these vectors as the input of trained neural network model, by an output sequence is calculated Column;
Step 4: decoding these sequences using prediction algorithm, generate the mark label of syllable;
Step 5: Laotian sentence prediction word segmentation result is obtained according to mark label.
2. Laotian segmenting method according to claim 1, it is characterised in that: the Laotian syllable rule refers to Laotian Syllable rules of pronunciation, mutually spelled by consonant plus tone, consonant with vowel, consonant and vowel are mutually spelled plus tone, consonant and vowel phase It spells plus dead closed syllable, consonant are mutually spelled with vowel plus closed syllable living and consonant and vowel is mutually spelled plus closed syllable living adds the spelling of tone to advise Then constitute.
3. Laotian segmenting method according to claim 1, it is characterised in that: the Laotian sentence refers to by Laotian Letter composition or by one of English alphabet, Arabic numerals, Laotian additional character and Laotian punctuation mark or It is a variety of to be formed plus Laotian letter.
4. Laotian segmenting method according to claim 1, it is characterised in that: the string matching algorithm, which refers to, to be based on The reverse maximum matching algorithm of dictionary.
5. Laotian segmenting method according to claim 1, it is characterised in that: the term vector, which refers to, utilizes term vector mould The corresponding term vector of Laotian syllable of type Word2vec training, and the input vector as neural network model.
6. Laotian segmenting method according to claim 1, it is characterised in that: the neural network model refers to two-way length Short-term memory neural network model.
7. Laotian segmenting method according to claim 1, it is characterised in that: the prediction algorithm refers to that Viterbi is calculated Method.
8. Laotian segmenting method according to claim 1, it is characterised in that: the mark label refer to set B, M, E,S};Wherein B refers to that syllable occupy prefix, and M refers to that syllable occupy in word, and E refers to that syllable occupy suffix, and S refers to syllable individually at word.
CN201810810863.3A 2018-07-23 2018-07-23 A kind of Laotian segmenting method Pending CN109255120A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810810863.3A CN109255120A (en) 2018-07-23 2018-07-23 A kind of Laotian segmenting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810810863.3A CN109255120A (en) 2018-07-23 2018-07-23 A kind of Laotian segmenting method

Publications (1)

Publication Number Publication Date
CN109255120A true CN109255120A (en) 2019-01-22

Family

ID=65049772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810810863.3A Pending CN109255120A (en) 2018-07-23 2018-07-23 A kind of Laotian segmenting method

Country Status (1)

Country Link
CN (1) CN109255120A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992770A (en) * 2019-03-04 2019-07-09 昆明理工大学 A kind of Laotian name entity recognition method based on combination neural net
CN110083824A (en) * 2019-03-18 2019-08-02 昆明理工大学 A kind of Laotian segmenting method based on Multi-Model Combination neural network
CN110096713A (en) * 2019-03-21 2019-08-06 昆明理工大学 A kind of Laotian organization names recognition methods based on SVM-BiLSTM-CRF
CN117436445A (en) * 2023-12-21 2024-01-23 珠海博维网络信息有限公司 Method and system for processing word segmentation of cantonese phrases

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324607A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for word segmentation of Thai texts
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression
CN107168957A (en) * 2017-06-12 2017-09-15 云南大学 A kind of Chinese word cutting method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324607A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for word segmentation of Thai texts
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression
CN107168957A (en) * 2017-06-12 2017-09-15 云南大学 A kind of Chinese word cutting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林颂凯 等: "基于卷积神经网络的缅甸语分词方法", 《中文信息学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992770A (en) * 2019-03-04 2019-07-09 昆明理工大学 A kind of Laotian name entity recognition method based on combination neural net
CN110083824A (en) * 2019-03-18 2019-08-02 昆明理工大学 A kind of Laotian segmenting method based on Multi-Model Combination neural network
CN110096713A (en) * 2019-03-21 2019-08-06 昆明理工大学 A kind of Laotian organization names recognition methods based on SVM-BiLSTM-CRF
CN117436445A (en) * 2023-12-21 2024-01-23 珠海博维网络信息有限公司 Method and system for processing word segmentation of cantonese phrases
CN117436445B (en) * 2023-12-21 2024-04-02 珠海博维网络信息有限公司 Method and system for processing word segmentation of cantonese phrases

Similar Documents

Publication Publication Date Title
CN109255120A (en) A kind of Laotian segmenting method
Elshafei et al. Statistical methods for automatic diacritization of Arabic text
Watson et al. Utilizing character and word embeddings for text normalization with sequence-to-sequence models
CN112507734B (en) Neural machine translation system based on romanized Uygur language
Rathod et al. Hindi and Marathi to English machine transliteration using SVM
Micher Improving coverage of an Inuktitut morphological analyzer using a segmental recurrent neural network
KR20190065665A (en) Apparatus and method for recognizing Korean named entity using deep-learning
Hung Vietnamese diacritics restoration using deep learning approach
Sampson Writing systems: methods for recording language
CN104408037A (en) Tibetan text vector model representation method
Aadil et al. English to Kashmiri transliteration system-a hybrid approach
Akeel et al. ANN and rule based method for english to arabic machine translation.
Somsap et al. Isarn Dharma word segmentation using a statistical approach with named entity recognition
Alabi et al. Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yor\ub\'a and Twi
Teshome et al. Phoneme-based English-Amharic statistical machine translation
CN109960782A (en) A kind of Tibetan language segmenting method and device based on deep neural network
Whitelaw et al. Named entity recognition using a character-based probabilistic approach
Ovi et al. BaNeP: An End-to-End Neural Network Based Model for Bangla Parts-of-Speech Tagging
Helms-Park et al. From proto-writing to multimedia literacy: Scripts and orthographies through the ages
Asahiah Development of a Standard Yorùbá digital text automatic diacritic restoration system
Gutkin et al. Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities
Jucksriporn et al. A minimum cluster-based trigram statistical model for Thai syllabification
Ishtiaq et al. English to Urdu Transliteration AS A Major Cause of Pronunciation Error in L1 & L2 Urdu Speakers of English: A Pedagogical Perspective
Namboodiri et al. On using classical poetry structure for Indian language post-processing
Abd Alhadi et al. Automatic Identification of Rhetorical Elements in classical Arabic Poetry.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190122