CN109960782A - A kind of Tibetan language segmenting method and device based on deep neural network - Google Patents

A kind of Tibetan language segmenting method and device based on deep neural network Download PDF

Info

Publication number
CN109960782A
CN109960782A CN201811614940.4A CN201811614940A CN109960782A CN 109960782 A CN109960782 A CN 109960782A CN 201811614940 A CN201811614940 A CN 201811614940A CN 109960782 A CN109960782 A CN 109960782A
Authority
CN
China
Prior art keywords
sequence
tibetan language
syllable
participle
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811614940.4A
Other languages
Chinese (zh)
Inventor
赵生捷
陈梦竹
杨恺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201811614940.4A priority Critical patent/CN109960782A/en
Publication of CN109960782A publication Critical patent/CN109960782A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The present invention relates to a kind of Tibetan language segmenting method and device based on deep neural network, wherein method includes: step S1: receiving original Tibetan language text, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;Step S3: it will be handled in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network, the subunit sequence with sequence mark finally obtained, as word segmentation result.Compared with prior art, the present invention has the advantages that equal participle success rate is high.

Description

A kind of Tibetan language segmenting method and device based on deep neural network
Technical field
The present invention relates to natural language processing fields, more particularly, to a kind of Tibetan language participle side based on deep neural network Method and device.
Background technique
With the development of information age, both at home and abroad for the research work of spoken and written languages Informatization Processing Technique also continuous Deeply.For Tibetan language as a kind of time-honored ancient language, the ancient books book and classical documents that recite with Tibetan language are vast as the open sea. And the key that can Tibetan language word stride into the information age is that success solves the problems, such as Tibetan information processing technology.Tibetan language segments Most basic and essential important link in Tibetan information processing, one section of Tibetan language text is only after passing through participle, ability Computer is allowed to handle these word sequences.Therefore, the result of study of Tibetan language participle, which will directly affect, subsequent such as hides The application and development of the technologies such as language semantic understanding, Tibetan information retrieval, the identification of Tibetan language machine translation, Tibetan voice.
Participle is exactly that serial continuous character is reassembled into word sequence according to certain standard or rule in simple terms Process.It is well known that in English, using space as natural delimiter between word, therefore in the processing of word understanding Also relative straightforward.And Tibetan language form such as Chinese is consistent, there is no any type of delimiter between word and word, this just give Chinese, Tibetan language etc. increases many difficulties without interval spoken and written languages information processing.For Chinese, domestic at present there are many researchs The system that mechanism, scholar develop more maturation in the natural language processing field, such as the language technology of Harbin Institute of Technology Platform LTP, Fudan University natural language processing java open source packet FudanNLP etc., the Chinese that these disclosed systems constantly push The progress and development of Language Processing.Tibetan information processing research is reviewed, basis is relatively weak, although having that much publishes to grind Study carefully achievement article, but publicly available system is very few, this constrains the development of Tibetan information processing to a certain extent.
For segmenting for this task, if only word segmentation processing is carried out by way of artificial nucleus couple, then this is aobvious It is so a huge and complicated process, takes time and effort.However, there is also following several hang-ups for automatic word segmentation: 1) segmenting discrimination Justice is eliminated;2) unregistered word (neologisms) identifies;3) wrong word, phonogram standardization;4) granularity problem is segmented.In addition to this, it hides Text is different from Chinese, and there is also a distinctive language issues: tightening word identification problem.In the information developed rapidly now In generation, many researchers have begun replaces manual type to carry out Tibetan language certainly with algorithm according to certain rules using computer Dynamic participle, Tibetan language participle common are two major classes: 1) based on character string (dictionary) matched method, such as: Forward Maximum Method method, Reversed maximum matching method, two-way maximum matching method etc.;Such method is realized simply, but they are highly dependent on the quality of dictionary, And ambiguity partition problem, unregistered word problem and the identification that entity can not be named cannot be effectively treated.2) it is based on statistical machine The method of the sequence labelling of device learning model, such as hidden Markov model (Hidden Markov Model, abbreviation HMM), condition Random field (Conditional Random Field, abbreviation CRF) etc., the accuracy of such method are better than being based on string matching Method is Tibetan language segmenting method the most popular at this stage, but the identification problem of unregistered word cannot still be located well Reason, and be inconvenient to increase user-oriented dictionary, it can be lost in speed, in addition, conventional machines learning method needs additional extractions special Sign.
In recent years deep learning presents its unique advantage in natural language processing field, and the method for deep learning is also New thinking is brought for Chinese words segmentation.Therefore, we can be with the Chinese word cutting method of reference deep learning, and locates Tibetan language portmanteau word phenomenon is managed, the automatic word segmentation model for being suitable for Tibetan language is formed.
Summary of the invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind based on depth nerve The Tibetan language segmenting method and device of network.
The purpose of the present invention can be achieved through the following technical solutions:
A kind of Tibetan language segmenting method based on deep neural network, comprising:
Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;
Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;
Step S3: it will be carried out in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network Processing, finally obtains the subunit sequence with sequence mark, as word segmentation result.
The step S2 is specifically included:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented Column.
The Tibetan language participle model based on deep neural network includes:
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, for the basic of the participle based on the vector form being converted to Subunit sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer obtains band orderly The subunit sequence for arranging label, as word segmentation result.
CRF layers of the treatment process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,xnFor the vector that each character in Tibetan language sentence X is converted to, T is Conversion fraction matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,For I-th of character is denoted as yiThe score of a label, n are the character number in Tibetan language sentence X, and y is the predictive marker sequences y of X =(y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible label sequences of X Column;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
A kind of Tibetan language participle device based on deep neural network, including memory, processor, and it is stored in memory In and the program that is executed by the processor, the processor performed the steps of when executing described program
Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;
Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;
Step S3: it will be carried out in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network Processing, finally obtains the subunit sequence with sequence mark, as word segmentation result.
The step S2 is specifically included:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented Column.
The Tibetan language participle model based on deep neural network includes:
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, for the basic of the participle based on the vector form being converted to Subunit sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer obtains band orderly The subunit sequence for arranging label, as word segmentation result.
CRF layers of the treatment process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,xnFor the vector that each character in Tibetan language sentence X is converted to, T is Conversion fraction matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,For I-th of character is denoted as yiThe score of a label, n are the character number in Tibetan language sentence X, and y is the predictive marker sequences y of X =(y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible label sequences of X Column;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
Compared with prior art, the invention has the following advantages:
1) since the frequency of occurrences of portmanteau word in Tibetan language is very high, and these words play the role of in different contexts it is different. Therefore it is difficult to determine for the syllable containing portmanteau word, it should be classified as a base character or two characters.This for Subsequent participle process will cause strong influence.For the distinctive portmanteau word language phenomenon of this Tibetan language, we can be used Mode based on condition random field (CRF) constitutes portmanteau word identification model, to solve to tighten word identification problem.
2) Tibetan language participle is carried out using deep neural network model, is translated into sequence mark task.Directly with most Basic vectorization atomic features convert, it is current that many predictions can be obtained in output layer as input by multilayered nonlinear The label of word.Deep learning mainly has following two points advantage: a) can by optimize final goal, effectively learn atomic features and The expression of context;B) deep neural network can more effectively portray long range sentence information.
3) for the input layer of deep neural network, we use the other vector of character level, in this way can be to a certain degree Upper effective solution unregistered word problem.
Detailed description of the invention
Fig. 1 is the flow diagram of the method for the present invention;
Fig. 2 is the schematic diagram of the Tibetan language participle model based on deep neural network.
Specific embodiment
The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention Premised on implemented, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to Following embodiments.
A kind of Tibetan language segmenting method based on deep neural network, as shown in Figure 1, comprising:
Step S1: original Tibetan language text is received, and based on syllable-dividing mark therein (between Tibetan language syllable) syllable Cutting obtains syllable sequence;
Step S2: by obtained syllable sequence input portmanteau word identification model, the basic unit sequence segmented, specifically Include:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented Column.
Syllable sequence is inputted into portmanteau word identification model, carries out portmanteau word identifying processing: due to common six portmanteau words (With) according to their functions in context of co-text it can be divided into two major classes: Yi Leiwei Lattice word is helped, it is another kind of to help lattice word to be non-.In this manner it is possible to which portmanteau word identification mission is converted into lexeme mark task.Specifically may be used It is realized using the mask method of condition random field (CRF);After portmanteau word identifying processing, a series of available participles Basic unit sequence;
Step S3: it will be carried out in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network Processing, finally obtains the subunit sequence with sequence mark, as word segmentation result.
Firstly, illustrating influence of the portmanteau word phenomenon for participle, we will be using popular " BMES " 4- here Tags method, B represent the beginning of vocabulary;M represents the middle section of vocabulary, and E represents the ending of vocabulary, and S represents single Syllable vocabulary.
When portmanteau word is not present in Tibetan language vocabulary, as shown in table 1, can be marked according to syllable number:
Table 1
When there is portmanteau word in Tibetan language vocabulary, as shown in table 2, above or equal to number of syllables, this takes number of labels The certainly effect in portmanteau word in Tibetan language vocabulary:
Table 2
As shown in Fig. 2, the Tibetan language participle model based on deep neural network includes:
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, for the basic of the participle based on the vector form being converted to Subunit sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer obtains band orderly The subunit sequence for arranging label, as word segmentation result.
Here, we are named as BiLSTM-CRF model, specific structure such as Fig. 2 using Tibetan language participle model.Most of In sequence mark task, since neural network structure has great dependence and the size of data set and quality to data Affect the training effect of model;Therefore we can be in combination with the structure of existing Linear Statistical Model and neural network, letter For list, exactly softmax is combined with CRF in output layer.We can use shot and long term memory network (Long Short-Term Memory, abbreviation LSTM) solve the problems, such as the extraction of sequence signature, CRF effectively utilizes Sentence-level in addition Mark information.Therefore, in this novel mixed structure, output will individually mark, but optimal label sequence Column.
In addition, what we used is not unidirectional LSTM structure, but two-way LSTM in BiLSTM-CRF model. Because for a Tibetan language sentence, unidirectional LSTM can only capture each word in context unidirectional information (above or Hereafter), so we capture bidirectional information (context) using two-way LSTM.
The treatment process for segmenting basic unit embeding layer is as follows: it is other that this layer will obtain each character level in Tibetan language sentence Vector, the input as neural network.Specifically, be exactly that Tibetan language is segmented, we possess a size and are | C | character Dictionary C.This dictionary is to extract to obtain from training set, and unknown character will replace (such as UNK) by an additional character.It is right In each character c ∈ C, a low-dimensional reality vector v can be mapped asc, whereinD is the dimension of vector space.So These vectors will form a matrix afterwardsFor each character c, corresponding vcIt can be from a lookup table It is retrieved in layer, which can be considered a simple projection layer, search to obtain character vector according to corresponding concordance list.
The treatment process of BiLSTM network layer is as follows: BiLSTM is a kind of special recurrent neural network.It is suitble to handle With relatively long critical event is spaced and postponed in predicted time sequence.It introduces several doors and hides shape to control and update State and memory cell, these doors are referred to as input gate, out gate and forget door.When the network that an information enters BiLSTM is worked as In, it can be according to rule to determine whether useful.The information for only meeting algorithm certification can just leave, and the information not being inconsistent then passes through Forget door to pass into silence, for a Tibetan language sentence (x containing n character1,x2,…,xn),xiThe corresponding vector of character is represented, Have:
it=σ (Wixxt+Wihht-1+Wicct-1+bi)
ft=σ (Wfxxt+Wfhht-1+Wfcct-1+bf)
ot=σ (Woxxt+Wohht-1+Wocct-1+bo)
ht=ot⊙tanh(ct)
Wherein, σ is Element-Level sigmoid activation primitive, and ⊙ is Element-Level multiplication, and W is weight matrix, the b amount of being biased towards.
Since we are using two-way LSTM network structure, for each character i, corresponding context Context is expressed as
The output of BiLSTM is P, and P is as CRF layers of input, to calculate s (X, y)
CRF layers for the treatment of process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,nFor the vector that each character in Tibetan language sentence X is converted to, T is Conversion fraction matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,For I-th of character is denoted as yiThe score of a label, n are the character number in Tibetan language sentence X, and y is the predictive marker sequences y of X =(y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible label sequences of X Column;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.

Claims (8)

1. a kind of Tibetan language segmenting method based on deep neural network characterized by comprising
Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;
Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;
Step S3: will be in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network Reason, finally obtains the subunit sequence with sequence mark, as word segmentation result.
2. a kind of Tibetan language segmenting method based on deep neural network according to claim 1, which is characterized in that the step Rapid S2 is specifically included:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented.
3. a kind of Tibetan language segmenting method based on deep neural network according to claim 1, which is characterized in that the base Include: in the Tibetan language participle model of deep neural network
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, the basic unit for the participle based on the vector form being converted to Sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer is obtained with sequence mark The subunit sequence of note, as word segmentation result.
4. a kind of Tibetan language segmenting method based on deep neural network according to claim 3, which is characterized in that described CRF layers for the treatment of process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,xnFor the vector that each character in Tibetan language sentence X is converted to, T is conversion Score matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,It is i-th A character is denoted as yiThe score of a label, n be Tibetan language sentence X in character number, y be X predictive marker sequences y= (y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible flags sequence of X;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
5. a kind of Tibetan language based on deep neural network segments device, which is characterized in that including memory, processor, Yi Jicun The program for being stored in memory and being executed by the processor, the processor perform the steps of when executing described program
Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;
Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;
Step S3: will be in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network Reason, finally obtains the subunit sequence with sequence mark, as word segmentation result.
6. a kind of Tibetan language based on deep neural network according to claim 5 segments device, which is characterized in that the step Rapid S2 is specifically included:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented.
7. a kind of Tibetan language based on deep neural network according to claim 5 segments device, which is characterized in that the base Include: in the Tibetan language participle model of deep neural network
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, the basic unit for the participle based on the vector form being converted to Sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer is obtained with sequence mark The subunit sequence of note, as word segmentation result.
8. a kind of Tibetan language based on deep neural network according to claim 7 segments device, which is characterized in that described CRF layers for the treatment of process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,xnFor the vector that each character in Tibetan language sentence X is converted to, T is conversion Score matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,It is i-th A character is denoted as yiThe score of a label, n be Tibetan language sentence X in character number, y be X predictive marker sequences y= (y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible flags sequence of X;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
CN201811614940.4A 2018-12-27 2018-12-27 A kind of Tibetan language segmenting method and device based on deep neural network Pending CN109960782A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811614940.4A CN109960782A (en) 2018-12-27 2018-12-27 A kind of Tibetan language segmenting method and device based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811614940.4A CN109960782A (en) 2018-12-27 2018-12-27 A kind of Tibetan language segmenting method and device based on deep neural network

Publications (1)

Publication Number Publication Date
CN109960782A true CN109960782A (en) 2019-07-02

Family

ID=67023408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811614940.4A Pending CN109960782A (en) 2018-12-27 2018-12-27 A kind of Tibetan language segmenting method and device based on deep neural network

Country Status (1)

Country Link
CN (1) CN109960782A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489750A (en) * 2019-08-12 2019-11-22 昆明理工大学 Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN116245096A (en) * 2022-12-09 2023-06-09 西南民族大学 Tibetan word segmentation evaluation set construction method based on local word list

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107168955A (en) * 2017-05-23 2017-09-15 南京大学 Word insertion and the Chinese word cutting method of neutral net using word-based context

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107168955A (en) * 2017-05-23 2017-09-15 南京大学 Word insertion and the Chinese word cutting method of neutral net using word-based context

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HUIDAN LIU ET AL.: "Tibetanword segmentation as syllable tagging using conditional random field", 《25TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION》 *
张子睿 等: "基于BI-LSTM-CRF模型的中文分词法", 《长春理工大学学报(自然科学版)》 *
李亚超 等: "基于条件随机场的藏语自动分词方法研究与实现", 《中文信息学报》 *
陈伟 等: "基于BiLSTM-CRF的关键词自动抽取", 《计算机科学》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489750A (en) * 2019-08-12 2019-11-22 昆明理工大学 Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN116245096A (en) * 2022-12-09 2023-06-09 西南民族大学 Tibetan word segmentation evaluation set construction method based on local word list
CN116245096B (en) * 2022-12-09 2024-02-20 西南民族大学 Tibetan word segmentation evaluation set construction method based on local word list

Similar Documents

Publication Publication Date Title
Baniata et al. A Neural Machine Translation Model for Arabic Dialects That Utilizes Multitask Learning (MTL).
CN109003601A (en) A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN107577662A (en) Towards the semantic understanding system and method for Chinese text
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
KR102043353B1 (en) Apparatus and method for recognizing Korean named entity using deep-learning
CN112541356B (en) Method and system for recognizing biomedical named entities
Adel et al. Features for factored language models for code-Switching speech.
CN112784604A (en) Entity linking method based on entity boundary network
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
Sun et al. VCWE: visual character-enhanced word embeddings
CN114153971A (en) Error-containing Chinese text error correction, identification and classification equipment
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN111222329B (en) Sentence vector training method, sentence vector model, sentence vector prediction method and sentence vector prediction system
CN109815497B (en) Character attribute extraction method based on syntactic dependency
CN109960782A (en) A kind of Tibetan language segmenting method and device based on deep neural network
Hung Vietnamese diacritics restoration using deep learning approach
Tachibana et al. Accent estimation of Japanese words from their surfaces and romanizations for building large vocabulary accent dictionaries
Ni et al. Multilingual Grapheme-to-Phoneme Conversion with Global Character Vectors.
CN110852063B (en) Word vector generation method and device based on bidirectional LSTM neural network
CN110866404B (en) Word vector generation method and device based on LSTM neural network
dos Santos et al. Training state-of-the-art Portuguese POS taggers without handcrafted features
CN112634878A (en) Speech recognition post-processing method and system and related equipment
Olivo et al. CRFPOST: Part-of-Speech Tagger for Filipino Texts using Conditional Random Fields
Samir et al. Training and evaluation of TreeTagger on Amazigh corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190702

RJ01 Rejection of invention patent application after publication