CN109960782A - A kind of Tibetan language segmenting method and device based on deep neural network - Google Patents
A kind of Tibetan language segmenting method and device based on deep neural network Download PDFInfo
- Publication number
- CN109960782A CN109960782A CN201811614940.4A CN201811614940A CN109960782A CN 109960782 A CN109960782 A CN 109960782A CN 201811614940 A CN201811614940 A CN 201811614940A CN 109960782 A CN109960782 A CN 109960782A
- Authority
- CN
- China
- Prior art keywords
- sequence
- tibetan language
- syllable
- participle
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The present invention relates to a kind of Tibetan language segmenting method and device based on deep neural network, wherein method includes: step S1: receiving original Tibetan language text, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;Step S3: it will be handled in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network, the subunit sequence with sequence mark finally obtained, as word segmentation result.Compared with prior art, the present invention has the advantages that equal participle success rate is high.
Description
Technical field
The present invention relates to natural language processing fields, more particularly, to a kind of Tibetan language participle side based on deep neural network
Method and device.
Background technique
With the development of information age, both at home and abroad for the research work of spoken and written languages Informatization Processing Technique also continuous
Deeply.For Tibetan language as a kind of time-honored ancient language, the ancient books book and classical documents that recite with Tibetan language are vast as the open sea.
And the key that can Tibetan language word stride into the information age is that success solves the problems, such as Tibetan information processing technology.Tibetan language segments
Most basic and essential important link in Tibetan information processing, one section of Tibetan language text is only after passing through participle, ability
Computer is allowed to handle these word sequences.Therefore, the result of study of Tibetan language participle, which will directly affect, subsequent such as hides
The application and development of the technologies such as language semantic understanding, Tibetan information retrieval, the identification of Tibetan language machine translation, Tibetan voice.
Participle is exactly that serial continuous character is reassembled into word sequence according to certain standard or rule in simple terms
Process.It is well known that in English, using space as natural delimiter between word, therefore in the processing of word understanding
Also relative straightforward.And Tibetan language form such as Chinese is consistent, there is no any type of delimiter between word and word, this just give Chinese,
Tibetan language etc. increases many difficulties without interval spoken and written languages information processing.For Chinese, domestic at present there are many researchs
The system that mechanism, scholar develop more maturation in the natural language processing field, such as the language technology of Harbin Institute of Technology
Platform LTP, Fudan University natural language processing java open source packet FudanNLP etc., the Chinese that these disclosed systems constantly push
The progress and development of Language Processing.Tibetan information processing research is reviewed, basis is relatively weak, although having that much publishes to grind
Study carefully achievement article, but publicly available system is very few, this constrains the development of Tibetan information processing to a certain extent.
For segmenting for this task, if only word segmentation processing is carried out by way of artificial nucleus couple, then this is aobvious
It is so a huge and complicated process, takes time and effort.However, there is also following several hang-ups for automatic word segmentation: 1) segmenting discrimination
Justice is eliminated;2) unregistered word (neologisms) identifies;3) wrong word, phonogram standardization;4) granularity problem is segmented.In addition to this, it hides
Text is different from Chinese, and there is also a distinctive language issues: tightening word identification problem.In the information developed rapidly now
In generation, many researchers have begun replaces manual type to carry out Tibetan language certainly with algorithm according to certain rules using computer
Dynamic participle, Tibetan language participle common are two major classes: 1) based on character string (dictionary) matched method, such as: Forward Maximum Method method,
Reversed maximum matching method, two-way maximum matching method etc.;Such method is realized simply, but they are highly dependent on the quality of dictionary,
And ambiguity partition problem, unregistered word problem and the identification that entity can not be named cannot be effectively treated.2) it is based on statistical machine
The method of the sequence labelling of device learning model, such as hidden Markov model (Hidden Markov Model, abbreviation HMM), condition
Random field (Conditional Random Field, abbreviation CRF) etc., the accuracy of such method are better than being based on string matching
Method is Tibetan language segmenting method the most popular at this stage, but the identification problem of unregistered word cannot still be located well
Reason, and be inconvenient to increase user-oriented dictionary, it can be lost in speed, in addition, conventional machines learning method needs additional extractions special
Sign.
In recent years deep learning presents its unique advantage in natural language processing field, and the method for deep learning is also
New thinking is brought for Chinese words segmentation.Therefore, we can be with the Chinese word cutting method of reference deep learning, and locates
Tibetan language portmanteau word phenomenon is managed, the automatic word segmentation model for being suitable for Tibetan language is formed.
Summary of the invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind based on depth nerve
The Tibetan language segmenting method and device of network.
The purpose of the present invention can be achieved through the following technical solutions:
A kind of Tibetan language segmenting method based on deep neural network, comprising:
Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;
Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;
Step S3: it will be carried out in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network
Processing, finally obtains the subunit sequence with sequence mark, as word segmentation result.
The step S2 is specifically included:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented
Column.
The Tibetan language participle model based on deep neural network includes:
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, for the basic of the participle based on the vector form being converted to
Subunit sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer obtains band orderly
The subunit sequence for arranging label, as word segmentation result.
CRF layers of the treatment process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,xnFor the vector that each character in Tibetan language sentence X is converted to, T is
Conversion fraction matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,For
I-th of character is denoted as yiThe score of a label, n are the character number in Tibetan language sentence X, and y is the predictive marker sequences y of X
=(y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible label sequences of X
Column;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
A kind of Tibetan language participle device based on deep neural network, including memory, processor, and it is stored in memory
In and the program that is executed by the processor, the processor performed the steps of when executing described program
Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;
Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;
Step S3: it will be carried out in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network
Processing, finally obtains the subunit sequence with sequence mark, as word segmentation result.
The step S2 is specifically included:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented
Column.
The Tibetan language participle model based on deep neural network includes:
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, for the basic of the participle based on the vector form being converted to
Subunit sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer obtains band orderly
The subunit sequence for arranging label, as word segmentation result.
CRF layers of the treatment process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,xnFor the vector that each character in Tibetan language sentence X is converted to, T is
Conversion fraction matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,For
I-th of character is denoted as yiThe score of a label, n are the character number in Tibetan language sentence X, and y is the predictive marker sequences y of X
=(y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible label sequences of X
Column;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
Compared with prior art, the invention has the following advantages:
1) since the frequency of occurrences of portmanteau word in Tibetan language is very high, and these words play the role of in different contexts it is different.
Therefore it is difficult to determine for the syllable containing portmanteau word, it should be classified as a base character or two characters.This for
Subsequent participle process will cause strong influence.For the distinctive portmanteau word language phenomenon of this Tibetan language, we can be used
Mode based on condition random field (CRF) constitutes portmanteau word identification model, to solve to tighten word identification problem.
2) Tibetan language participle is carried out using deep neural network model, is translated into sequence mark task.Directly with most
Basic vectorization atomic features convert, it is current that many predictions can be obtained in output layer as input by multilayered nonlinear
The label of word.Deep learning mainly has following two points advantage: a) can by optimize final goal, effectively learn atomic features and
The expression of context;B) deep neural network can more effectively portray long range sentence information.
3) for the input layer of deep neural network, we use the other vector of character level, in this way can be to a certain degree
Upper effective solution unregistered word problem.
Detailed description of the invention
Fig. 1 is the flow diagram of the method for the present invention;
Fig. 2 is the schematic diagram of the Tibetan language participle model based on deep neural network.
Specific embodiment
The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention
Premised on implemented, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to
Following embodiments.
A kind of Tibetan language segmenting method based on deep neural network, as shown in Figure 1, comprising:
Step S1: original Tibetan language text is received, and based on syllable-dividing mark therein (between Tibetan language syllable) syllable
Cutting obtains syllable sequence;
Step S2: by obtained syllable sequence input portmanteau word identification model, the basic unit sequence segmented, specifically
Include:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented
Column.
Syllable sequence is inputted into portmanteau word identification model, carries out portmanteau word identifying processing: due to common six portmanteau words (With) according to their functions in context of co-text it can be divided into two major classes: Yi Leiwei
Lattice word is helped, it is another kind of to help lattice word to be non-.In this manner it is possible to which portmanteau word identification mission is converted into lexeme mark task.Specifically may be used
It is realized using the mask method of condition random field (CRF);After portmanteau word identifying processing, a series of available participles
Basic unit sequence;
Step S3: it will be carried out in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network
Processing, finally obtains the subunit sequence with sequence mark, as word segmentation result.
Firstly, illustrating influence of the portmanteau word phenomenon for participle, we will be using popular " BMES " 4- here
Tags method, B represent the beginning of vocabulary;M represents the middle section of vocabulary, and E represents the ending of vocabulary, and S represents single
Syllable vocabulary.
When portmanteau word is not present in Tibetan language vocabulary, as shown in table 1, can be marked according to syllable number:
Table 1
When there is portmanteau word in Tibetan language vocabulary, as shown in table 2, above or equal to number of syllables, this takes number of labels
The certainly effect in portmanteau word in Tibetan language vocabulary:
Table 2
As shown in Fig. 2, the Tibetan language participle model based on deep neural network includes:
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, for the basic of the participle based on the vector form being converted to
Subunit sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer obtains band orderly
The subunit sequence for arranging label, as word segmentation result.
Here, we are named as BiLSTM-CRF model, specific structure such as Fig. 2 using Tibetan language participle model.Most of
In sequence mark task, since neural network structure has great dependence and the size of data set and quality to data
Affect the training effect of model;Therefore we can be in combination with the structure of existing Linear Statistical Model and neural network, letter
For list, exactly softmax is combined with CRF in output layer.We can use shot and long term memory network (Long
Short-Term Memory, abbreviation LSTM) solve the problems, such as the extraction of sequence signature, CRF effectively utilizes Sentence-level in addition
Mark information.Therefore, in this novel mixed structure, output will individually mark, but optimal label sequence
Column.
In addition, what we used is not unidirectional LSTM structure, but two-way LSTM in BiLSTM-CRF model.
Because for a Tibetan language sentence, unidirectional LSTM can only capture each word in context unidirectional information (above or
Hereafter), so we capture bidirectional information (context) using two-way LSTM.
The treatment process for segmenting basic unit embeding layer is as follows: it is other that this layer will obtain each character level in Tibetan language sentence
Vector, the input as neural network.Specifically, be exactly that Tibetan language is segmented, we possess a size and are | C | character
Dictionary C.This dictionary is to extract to obtain from training set, and unknown character will replace (such as UNK) by an additional character.It is right
In each character c ∈ C, a low-dimensional reality vector v can be mapped asc, whereinD is the dimension of vector space.So
These vectors will form a matrix afterwardsFor each character c, corresponding vcIt can be from a lookup table
It is retrieved in layer, which can be considered a simple projection layer, search to obtain character vector according to corresponding concordance list.
The treatment process of BiLSTM network layer is as follows: BiLSTM is a kind of special recurrent neural network.It is suitble to handle
With relatively long critical event is spaced and postponed in predicted time sequence.It introduces several doors and hides shape to control and update
State and memory cell, these doors are referred to as input gate, out gate and forget door.When the network that an information enters BiLSTM is worked as
In, it can be according to rule to determine whether useful.The information for only meeting algorithm certification can just leave, and the information not being inconsistent then passes through
Forget door to pass into silence, for a Tibetan language sentence (x containing n character1,x2,…,xn),xiThe corresponding vector of character is represented,
Have:
it=σ (Wixxt+Wihht-1+Wicct-1+bi)
ft=σ (Wfxxt+Wfhht-1+Wfcct-1+bf)
ot=σ (Woxxt+Wohht-1+Wocct-1+bo)
ht=ot⊙tanh(ct)
Wherein, σ is Element-Level sigmoid activation primitive, and ⊙ is Element-Level multiplication, and W is weight matrix, the b amount of being biased towards.
Since we are using two-way LSTM network structure, for each character i, corresponding context
Context is expressed as
The output of BiLSTM is P, and P is as CRF layers of input, to calculate s (X, y)
CRF layers for the treatment of process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,nFor the vector that each character in Tibetan language sentence X is converted to, T is
Conversion fraction matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,For
I-th of character is denoted as yiThe score of a label, n are the character number in Tibetan language sentence X, and y is the predictive marker sequences y of X
=(y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible label sequences of X
Column;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
Claims (8)
1. a kind of Tibetan language segmenting method based on deep neural network characterized by comprising
Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;
Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;
Step S3: will be in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network
Reason, finally obtains the subunit sequence with sequence mark, as word segmentation result.
2. a kind of Tibetan language segmenting method based on deep neural network according to claim 1, which is characterized in that the step
Rapid S2 is specifically included:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented.
3. a kind of Tibetan language segmenting method based on deep neural network according to claim 1, which is characterized in that the base
Include: in the Tibetan language participle model of deep neural network
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, the basic unit for the participle based on the vector form being converted to
Sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer is obtained with sequence mark
The subunit sequence of note, as word segmentation result.
4. a kind of Tibetan language segmenting method based on deep neural network according to claim 3, which is characterized in that described
CRF layers for the treatment of process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,xnFor the vector that each character in Tibetan language sentence X is converted to, T is conversion
Score matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,It is i-th
A character is denoted as yiThe score of a label, n be Tibetan language sentence X in character number, y be X predictive marker sequences y=
(y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible flags sequence of X;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
5. a kind of Tibetan language based on deep neural network segments device, which is characterized in that including memory, processor, Yi Jicun
The program for being stored in memory and being executed by the processor, the processor perform the steps of when executing described program
Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence;
Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented;
Step S3: will be in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network
Reason, finally obtains the subunit sequence with sequence mark, as word segmentation result.
6. a kind of Tibetan language based on deep neural network according to claim 5 segments device, which is characterized in that the step
Rapid S2 is specifically included:
Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence;
Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark;
Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented.
7. a kind of Tibetan language based on deep neural network according to claim 5 segments device, which is characterized in that the base
Include: in the Tibetan language participle model of deep neural network
Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector;
BiLSTM network layer is connect with vector embeding layer, the basic unit for the participle based on the vector form being converted to
Sequence exports score matrix;
It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer is obtained with sequence mark
The subunit sequence of note, as word segmentation result.
8. a kind of Tibetan language based on deep neural network according to claim 7 segments device, which is characterized in that described
CRF layers for the treatment of process specifically includes:
Step S31: for a Tibetan language sentence X (x1,x2,…,xn), obtain its whole score:
Wherein: s (X, y) is whole score, x1,x2,…,xnFor the vector that each character in Tibetan language sentence X is converted to, T is conversion
Score matrix, P are the output score matrix of BiLSTM,To mark yiIt is transformed into label yi+1Score,It is i-th
A character is denoted as yiThe score of a label, n be Tibetan language sentence X in character number, y be X predictive marker sequences y=
(y1,y2,…,yn);
Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:
Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, YxFor all possible flags sequence of X;
Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:
Wherein: y*For forecasting sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811614940.4A CN109960782A (en) | 2018-12-27 | 2018-12-27 | A kind of Tibetan language segmenting method and device based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811614940.4A CN109960782A (en) | 2018-12-27 | 2018-12-27 | A kind of Tibetan language segmenting method and device based on deep neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109960782A true CN109960782A (en) | 2019-07-02 |
Family
ID=67023408
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811614940.4A Pending CN109960782A (en) | 2018-12-27 | 2018-12-27 | A kind of Tibetan language segmenting method and device based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109960782A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110489750A (en) * | 2019-08-12 | 2019-11-22 | 昆明理工大学 | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF |
CN116245096A (en) * | 2022-12-09 | 2023-06-09 | 西南民族大学 | Tibetan word segmentation evaluation set construction method based on local word list |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106569998A (en) * | 2016-10-27 | 2017-04-19 | 浙江大学 | Text named entity recognition method based on Bi-LSTM, CNN and CRF |
CN107168955A (en) * | 2017-05-23 | 2017-09-15 | 南京大学 | Word insertion and the Chinese word cutting method of neutral net using word-based context |
-
2018
- 2018-12-27 CN CN201811614940.4A patent/CN109960782A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106569998A (en) * | 2016-10-27 | 2017-04-19 | 浙江大学 | Text named entity recognition method based on Bi-LSTM, CNN and CRF |
CN107168955A (en) * | 2017-05-23 | 2017-09-15 | 南京大学 | Word insertion and the Chinese word cutting method of neutral net using word-based context |
Non-Patent Citations (4)
Title |
---|
HUIDAN LIU ET AL.: "Tibetanword segmentation as syllable tagging using conditional random field", 《25TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION》 * |
张子睿 等: "基于BI-LSTM-CRF模型的中文分词法", 《长春理工大学学报(自然科学版)》 * |
李亚超 等: "基于条件随机场的藏语自动分词方法研究与实现", 《中文信息学报》 * |
陈伟 等: "基于BiLSTM-CRF的关键词自动抽取", 《计算机科学》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110489750A (en) * | 2019-08-12 | 2019-11-22 | 昆明理工大学 | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF |
CN116245096A (en) * | 2022-12-09 | 2023-06-09 | 西南民族大学 | Tibetan word segmentation evaluation set construction method based on local word list |
CN116245096B (en) * | 2022-12-09 | 2024-02-20 | 西南民族大学 | Tibetan word segmentation evaluation set construction method based on local word list |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Baniata et al. | A Neural Machine Translation Model for Arabic Dialects That Utilizes Multitask Learning (MTL). | |
CN109003601A (en) | A kind of across language end-to-end speech recognition methods for low-resource Tujia language | |
CN107577662A (en) | Towards the semantic understanding system and method for Chinese text | |
WO2008107305A2 (en) | Search-based word segmentation method and device for language without word boundary tag | |
CN111767718B (en) | Chinese grammar error correction method based on weakened grammar error feature representation | |
KR102043353B1 (en) | Apparatus and method for recognizing Korean named entity using deep-learning | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
Adel et al. | Features for factored language models for code-Switching speech. | |
CN112784604A (en) | Entity linking method based on entity boundary network | |
CN114676255A (en) | Text processing method, device, equipment, storage medium and computer program product | |
Sun et al. | VCWE: visual character-enhanced word embeddings | |
CN114153971A (en) | Error-containing Chinese text error correction, identification and classification equipment | |
CN114757184B (en) | Method and system for realizing knowledge question and answer in aviation field | |
CN111222329B (en) | Sentence vector training method, sentence vector model, sentence vector prediction method and sentence vector prediction system | |
CN109815497B (en) | Character attribute extraction method based on syntactic dependency | |
CN109960782A (en) | A kind of Tibetan language segmenting method and device based on deep neural network | |
Hung | Vietnamese diacritics restoration using deep learning approach | |
Tachibana et al. | Accent estimation of Japanese words from their surfaces and romanizations for building large vocabulary accent dictionaries | |
Ni et al. | Multilingual Grapheme-to-Phoneme Conversion with Global Character Vectors. | |
CN110852063B (en) | Word vector generation method and device based on bidirectional LSTM neural network | |
CN110866404B (en) | Word vector generation method and device based on LSTM neural network | |
dos Santos et al. | Training state-of-the-art Portuguese POS taggers without handcrafted features | |
CN112634878A (en) | Speech recognition post-processing method and system and related equipment | |
Olivo et al. | CRFPOST: Part-of-Speech Tagger for Filipino Texts using Conditional Random Fields | |
Samir et al. | Training and evaluation of TreeTagger on Amazigh corpus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190702 |
|
RJ01 | Rejection of invention patent application after publication |