CN109960782A

CN109960782A - A kind of Tibetan language segmenting method and device based on deep neural network

Info

Publication number: CN109960782A
Application number: CN201811614940.4A
Authority: CN
Inventors: 赵生捷; 陈梦竹; 杨恺
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2019-07-02

Abstract

The present invention relates to a kind of Tibetan language segmenting method and device based on deep neural network, wherein method includes: step S1: receiving original Tibetan language text, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence；Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented；Step S3: it will be handled in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network, the subunit sequence with sequence mark finally obtained, as word segmentation result.Compared with prior art, the present invention has the advantages that equal participle success rate is high.

Description

A kind of Tibetan language segmenting method and device based on deep neural network

Technical field

The present invention relates to natural language processing fields, more particularly, to a kind of Tibetan language participle side based on deep neural network Method and device.

Background technique

With the development of information age, both at home and abroad for the research work of spoken and written languages Informatization Processing Technique also continuous Deeply.For Tibetan language as a kind of time-honored ancient language, the ancient books book and classical documents that recite with Tibetan language are vast as the open sea. And the key that can Tibetan language word stride into the information age is that success solves the problems, such as Tibetan information processing technology.Tibetan language segments Most basic and essential important link in Tibetan information processing, one section of Tibetan language text is only after passing through participle, ability Computer is allowed to handle these word sequences.Therefore, the result of study of Tibetan language participle, which will directly affect, subsequent such as hides The application and development of the technologies such as language semantic understanding, Tibetan information retrieval, the identification of Tibetan language machine translation, Tibetan voice.

Participle is exactly that serial continuous character is reassembled into word sequence according to certain standard or rule in simple terms Process.It is well known that in English, using space as natural delimiter between word, therefore in the processing of word understanding Also relative straightforward.And Tibetan language form such as Chinese is consistent, there is no any type of delimiter between word and word, this just give Chinese, Tibetan language etc. increases many difficulties without interval spoken and written languages information processing.For Chinese, domestic at present there are many researchs The system that mechanism, scholar develop more maturation in the natural language processing field, such as the language technology of Harbin Institute of Technology Platform LTP, Fudan University natural language processing java open source packet FudanNLP etc., the Chinese that these disclosed systems constantly push The progress and development of Language Processing.Tibetan information processing research is reviewed, basis is relatively weak, although having that much publishes to grind Study carefully achievement article, but publicly available system is very few, this constrains the development of Tibetan information processing to a certain extent.

For segmenting for this task, if only word segmentation processing is carried out by way of artificial nucleus couple, then this is aobvious It is so a huge and complicated process, takes time and effort.However, there is also following several hang-ups for automatic word segmentation: 1) segmenting discrimination Justice is eliminated；2) unregistered word (neologisms) identifies；3) wrong word, phonogram standardization；4) granularity problem is segmented.In addition to this, it hides Text is different from Chinese, and there is also a distinctive language issues: tightening word identification problem.In the information developed rapidly now In generation, many researchers have begun replaces manual type to carry out Tibetan language certainly with algorithm according to certain rules using computer Dynamic participle, Tibetan language participle common are two major classes: 1) based on character string (dictionary) matched method, such as: Forward Maximum Method method, Reversed maximum matching method, two-way maximum matching method etc.；Such method is realized simply, but they are highly dependent on the quality of dictionary, And ambiguity partition problem, unregistered word problem and the identification that entity can not be named cannot be effectively treated.2) it is based on statistical machine The method of the sequence labelling of device learning model, such as hidden Markov model (Hidden Markov Model, abbreviation HMM), condition Random field (Conditional Random Field, abbreviation CRF) etc., the accuracy of such method are better than being based on string matching Method is Tibetan language segmenting method the most popular at this stage, but the identification problem of unregistered word cannot still be located well Reason, and be inconvenient to increase user-oriented dictionary, it can be lost in speed, in addition, conventional machines learning method needs additional extractions special Sign.

In recent years deep learning presents its unique advantage in natural language processing field, and the method for deep learning is also New thinking is brought for Chinese words segmentation.Therefore, we can be with the Chinese word cutting method of reference deep learning, and locates Tibetan language portmanteau word phenomenon is managed, the automatic word segmentation model for being suitable for Tibetan language is formed.

Summary of the invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind based on depth nerve The Tibetan language segmenting method and device of network.

The purpose of the present invention can be achieved through the following technical solutions:

A kind of Tibetan language segmenting method based on deep neural network, comprising:

Step S1: original Tibetan language text is received, and the cutting based on syllable-dividing mark syllable therein obtains syllable sequence；

Step S2: obtained syllable sequence is inputted into portmanteau word identification model, the basic unit sequence segmented；

Step S3: it will be carried out in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network Processing, finally obtains the subunit sequence with sequence mark, as word segmentation result.

The step S2 is specifically included:

Step S21: obtained syllable sequence is inputted into portmanteau word identification model, identifies the portmanteau word in syllable sequence；

Step S22: judging whether each portmanteau word is case adverbial verb, if it has, then at as participle mark；

Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented Column.

The Tibetan language participle model based on deep neural network includes:

Vector embeding layer, for each character in the obtained basic unit sequence respectively segmented to be converted to vector；

BiLSTM network layer is connect with vector embeding layer, for the basic of the participle based on the vector form being converted to Subunit sequence exports score matrix；

It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer obtains band orderly The subunit sequence for arranging label, as word segmentation result.

CRF layers of the treatment process specifically includes:

Step S31: for a Tibetan language sentence X (x₁,x₂,…,x_n), obtain its whole score:

Wherein: s (X, y) is whole score, x₁,x₂,…,x_nFor the vector that each character in Tibetan language sentence X is converted to, T is Conversion fraction matrix, P are the output score matrix of BiLSTM,To mark y_iIt is transformed into label y_i+1Score,For I-th of character is denoted as y_iThe score of a label, n are the character number in Tibetan language sentence X, and y is the predictive marker sequences y of X =(y₁,y₂,…,y_n)；

Step S32: calculating the probability of y, and obtains maximizing the log probability of correct labeling sequence:

Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, Y_xFor all possible label sequences of X Column；

Step S33: when decoding, we are output sequence by the sequence prediction of largest score is obtained:

Wherein: y^*For forecasting sequence.

A kind of Tibetan language participle device based on deep neural network, including memory, processor, and it is stored in memory In and the program that is executed by the processor, the processor performed the steps of when executing described program

The step S2 is specifically included:

The Tibetan language participle model based on deep neural network includes:

CRF layers of the treatment process specifically includes:

Wherein: y^*For forecasting sequence.

Compared with prior art, the invention has the following advantages:

1) since the frequency of occurrences of portmanteau word in Tibetan language is very high, and these words play the role of in different contexts it is different. Therefore it is difficult to determine for the syllable containing portmanteau word, it should be classified as a base character or two characters.This for Subsequent participle process will cause strong influence.For the distinctive portmanteau word language phenomenon of this Tibetan language, we can be used Mode based on condition random field (CRF) constitutes portmanteau word identification model, to solve to tighten word identification problem.

2) Tibetan language participle is carried out using deep neural network model, is translated into sequence mark task.Directly with most Basic vectorization atomic features convert, it is current that many predictions can be obtained in output layer as input by multilayered nonlinear The label of word.Deep learning mainly has following two points advantage: a) can by optimize final goal, effectively learn atomic features and The expression of context；B) deep neural network can more effectively portray long range sentence information.

3) for the input layer of deep neural network, we use the other vector of character level, in this way can be to a certain degree Upper effective solution unregistered word problem.

Detailed description of the invention

Fig. 1 is the flow diagram of the method for the present invention；

Fig. 2 is the schematic diagram of the Tibetan language participle model based on deep neural network.

Specific embodiment

The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention Premised on implemented, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to Following embodiments.

A kind of Tibetan language segmenting method based on deep neural network, as shown in Figure 1, comprising:

Step S1: original Tibetan language text is received, and based on syllable-dividing mark therein (between Tibetan language syllable) syllable Cutting obtains syllable sequence；

Step S2: by obtained syllable sequence input portmanteau word identification model, the basic unit sequence segmented, specifically Include:

Syllable sequence is inputted into portmanteau word identification model, carries out portmanteau word identifying processing: due to common six portmanteau words (With) according to their functions in context of co-text it can be divided into two major classes: Yi Leiwei Lattice word is helped, it is another kind of to help lattice word to be non-.In this manner it is possible to which portmanteau word identification mission is converted into lexeme mark task.Specifically may be used It is realized using the mask method of condition random field (CRF)；After portmanteau word identifying processing, a series of available participles Basic unit sequence；

Firstly, illustrating influence of the portmanteau word phenomenon for participle, we will be using popular " BMES " 4- here Tags method, B represent the beginning of vocabulary；M represents the middle section of vocabulary, and E represents the ending of vocabulary, and S represents single Syllable vocabulary.

When portmanteau word is not present in Tibetan language vocabulary, as shown in table 1, can be marked according to syllable number:

Table 1

When there is portmanteau word in Tibetan language vocabulary, as shown in table 2, above or equal to number of syllables, this takes number of labels The certainly effect in portmanteau word in Tibetan language vocabulary:

Table 2

As shown in Fig. 2, the Tibetan language participle model based on deep neural network includes:

Here, we are named as BiLSTM-CRF model, specific structure such as Fig. 2 using Tibetan language participle model.Most of In sequence mark task, since neural network structure has great dependence and the size of data set and quality to data Affect the training effect of model；Therefore we can be in combination with the structure of existing Linear Statistical Model and neural network, letter For list, exactly softmax is combined with CRF in output layer.We can use shot and long term memory network (Long Short-Term Memory, abbreviation LSTM) solve the problems, such as the extraction of sequence signature, CRF effectively utilizes Sentence-level in addition Mark information.Therefore, in this novel mixed structure, output will individually mark, but optimal label sequence Column.

In addition, what we used is not unidirectional LSTM structure, but two-way LSTM in BiLSTM-CRF model. Because for a Tibetan language sentence, unidirectional LSTM can only capture each word in context unidirectional information (above or Hereafter), so we capture bidirectional information (context) using two-way LSTM.

The treatment process for segmenting basic unit embeding layer is as follows: it is other that this layer will obtain each character level in Tibetan language sentence Vector, the input as neural network.Specifically, be exactly that Tibetan language is segmented, we possess a size and are | C | character Dictionary C.This dictionary is to extract to obtain from training set, and unknown character will replace (such as UNK) by an additional character.It is right In each character c ∈ C, a low-dimensional reality vector v can be mapped as_c, whereinD is the dimension of vector space.So These vectors will form a matrix afterwardsFor each character c, corresponding v_cIt can be from a lookup table It is retrieved in layer, which can be considered a simple projection layer, search to obtain character vector according to corresponding concordance list.

The treatment process of BiLSTM network layer is as follows: BiLSTM is a kind of special recurrent neural network.It is suitble to handle With relatively long critical event is spaced and postponed in predicted time sequence.It introduces several doors and hides shape to control and update State and memory cell, these doors are referred to as input gate, out gate and forget door.When the network that an information enters BiLSTM is worked as In, it can be according to rule to determine whether useful.The information for only meeting algorithm certification can just leave, and the information not being inconsistent then passes through Forget door to pass into silence, for a Tibetan language sentence (x containing n character₁,x₂,…,x_n),x_iThe corresponding vector of character is represented, Have:

i_t=σ (W_ixx_t+W_ihh_t-1+W_icc_t-1+b_i)

f_t=σ (W_fxx_t+W_fhh_t-1+W_fcc_t-1+b_f)

o_t=σ (W_oxx_t+W_ohh_t-1+W_occ_t-1+b_o)

h_t=o_t⊙tanh(c_t)

Wherein, σ is Element-Level sigmoid activation primitive, and ⊙ is Element-Level multiplication, and W is weight matrix, the b amount of being biased towards.

Since we are using two-way LSTM network structure, for each character i, corresponding context Context is expressed as

The output of BiLSTM is P, and P is as CRF layers of input, to calculate s (X, y)

CRF layers for the treatment of process specifically includes:

Wherein: s (X, y) is whole score, x₁,x₂,…,_nFor the vector that each character in Tibetan language sentence X is converted to, T is Conversion fraction matrix, P are the output score matrix of BiLSTM,To mark y_iIt is transformed into label y_i+1Score,For I-th of character is denoted as y_iThe score of a label, n are the character number in Tibetan language sentence X, and y is the predictive marker sequences y of X =(y₁,y₂,…,y_n)；

Wherein: y^*For forecasting sequence.

Claims

1. a kind of Tibetan language segmenting method based on deep neural network characterized by comprising

Step S3: will be in the basic unit sequence inputting of participle to the Tibetan language participle model based on deep neural network Reason, finally obtains the subunit sequence with sequence mark, as word segmentation result.

2. a kind of Tibetan language segmenting method based on deep neural network according to claim 1, which is characterized in that the step Rapid S2 is specifically included:

Step S23: according to being segmented at obtained each participle mark to syllable sequence, the basic unit sequence segmented.

3. a kind of Tibetan language segmenting method based on deep neural network according to claim 1, which is characterized in that the base Include: in the Tibetan language participle model of deep neural network

BiLSTM network layer is connect with vector embeding layer, the basic unit for the participle based on the vector form being converted to Sequence exports score matrix；

It CRF layers, is connect with BiLSTM network layer, the score matrix for being exported according to BiLSTM network layer is obtained with sequence mark The subunit sequence of note, as word segmentation result.

4. a kind of Tibetan language segmenting method based on deep neural network according to claim 3, which is characterized in that described CRF layers for the treatment of process specifically includes:

Wherein: s (X, y) is whole score, x₁,x₂,…,x_nFor the vector that each character in Tibetan language sentence X is converted to, T is conversion Score matrix, P are the output score matrix of BiLSTM,To mark y_iIt is transformed into label y_i+1Score,It is i-th A character is denoted as y_iThe score of a label, n be Tibetan language sentence X in character number, y be X predictive marker sequences y= (y₁,y₂,…,y_n)；

Wherein: p (y | X) is the probability of y,For the possible flag sequence of one of which of X, Y_xFor all possible flags sequence of X；

Wherein: y^*For forecasting sequence.

5. a kind of Tibetan language based on deep neural network segments device, which is characterized in that including memory, processor, Yi Jicun The program for being stored in memory and being executed by the processor, the processor perform the steps of when executing described program

6. a kind of Tibetan language based on deep neural network according to claim 5 segments device, which is characterized in that the step Rapid S2 is specifically included:

7. a kind of Tibetan language based on deep neural network according to claim 5 segments device, which is characterized in that the base Include: in the Tibetan language participle model of deep neural network

8. a kind of Tibetan language based on deep neural network according to claim 7 segments device, which is characterized in that described CRF layers for the treatment of process specifically includes:

Wherein: y^*For forecasting sequence.