CN109558569A

CN109558569A - A kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model

Info

Publication number: CN109558569A
Application number: CN201811531266.3A
Authority: CN
Inventors: 周兰江; 王兴金; 张建安; 周枫
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2019-04-02

Abstract

The Laotian part-of-speech tagging method based on BiLSTM+CRF model that the present invention relates to a kind of, it belongs to natural language processing and machine learning techniques field.BiLSTM is based on LSTM structure, and BiLSTM can use contextual information to carry out part-of-speech tagging.By a sentence inputting to part-of-speech tagging into BiLSTM, BiLSTM can export the part of speech probability distribution of each word in sentence by calculating, and traditional way can select the maximum probability part of speech of each distribution, as part-of-speech tagging result.But the influence between part of speech is not accounted in this way, such as: verb etc. cannot be connect after quantifier.Therefore CRF model is introduced to solve this problem, and CRF model can be connected to the output layer of BiLSTM.Using the Laotian part-of-speech tagging model based on BiLSTM and CRF, part-of-speech tagging effectively can be carried out to Laotian, therefore the present invention has certain research significance.

Description

A kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model

Technical field

The Laotian part-of-speech tagging method based on BiLSTM+CRF model that the present invention relates to a kind of, belongs to natural language processing With machine learning techniques field.

Background technique

Part-of-speech tagging be exactly be each word in sentence, determine the process of its best part of speech.Part-of-speech tagging is many natural languages One of the pre-treatment step of processing task, it be prepare for subsequent prior work, such as: syntactic analysis, information extraction Deng.It is rule-based for studying the technology used in early days, but Rulemaking is very cumbersome.Therefore Statistics-Based Method is able to Development, the model that early stage study of statistical methods uses have hidden equine husband model, condition random field (CRF) model and maximum entropy mould Type.Due to the rise of deep learning, research starts steering and is carried out with deep learning the research of part-of-speech tagging, achieves good Achievement.But this current technical idea was not studied in Laotian part-of-speech tagging, and model is also that oneself is built.

Summary of the invention

The Laotian part-of-speech tagging method based on BiLSTM+CRF model that the object of the present invention is to provide a kind of, by depth The two-way long short-term memory Recognition with Recurrent Neural Network BiLSTM technology of degree study is ground with traditional statistical method condition random field CRF Study carefully, be used in Laotian part-of-speech tagging, in an experiment and is achieved good results.

The technical solution adopted by the present invention is that: a kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model, packet Include following steps:

The building of Step1, BiLSTM+CRF model

Laotian part-of-speech tagging model based on BiLSTM and CRF comprising five layers: input layer, LSTM layers of forward direction, backward LSTM layers, full articulamentum and CRF layers；

(1) input layer:

The received data of input layer are the Laos sentence W with n word₁…W_t…W_n, word enter BiLSTM before need turn The form for being changed to number just can be carried out calculating, therefore input one term vector matrix of layer building, and each Laos's word can be Its corresponding term vector is found in term vector matrix, the value of term vector represents the feature of the word, and term vector will also represent word input To it is preceding to LSTM layers, it is LSTM layers backward in corresponding LSTM, carry out calculating word information；

(2) forward direction LSTM layers:

LSTM layers of forward direction are made of LSTM, and LSTM determines reservation, output and the deletion of information, come from input layer Laos sentence In the term vector of each word will sequentially be input in corresponding LSTM, LSTM is connected by input sequence forward direction, each LSTM output Two parts word information: forward-facing state information FS and forward direction output information FH, information are all presented with a matrix type, forward-facing state letter Breath can be handed in the layer always, participate in the calculating of next LSTM, and forward direction output information will be output to full articulamentum meter Calculate part of speech probability distribution；

(3) LSTM layers backward:

Backward LSTM layers are also to be made of LSTM, and the term vector of each word is sequentially input in input layer Laos sentence In corresponding LSTM, but LSTM presses input sequence reverse connection, and each LSTM exports two parts word information: backward status information BS And backward output information BH, backward status information will be handed in the layer always, participate in the calculating of next LSTM, and it is backward Output information will be output to full articulamentum and calculate part of speech probability distribution；

(4) full articulamentum:

Full articulamentum is made of simple neural network unit, and each received data of unit are by LSTM layers of forward and backward The forward direction output information FH of output, backward output information BH, FH and BH cross calculating in unit back warp, will obtain part of speech probability point Cloth；

(5) CRF layers:

After full articulamentum obtains the probability distribution of each word, CRF model is distributed as sentence using these and calculates best word Property annotated sequence, CRF layers guarantee select greater probability part of speech from each distribution while, also will consider part of speech between phase Mutually influence；

The training of Step2, BiLSTM+CRF model

Training BiLSTM+CRF model uses Laotian chapter part-of-speech tagging corpus, it may be assumed that more are marked part of speech Laotian article, training uses the log-likelihood function based on sentence level first, general to calculate the part of speech that full articulamentum obtains The gap that rate distribution is really distributed with part of speech in Laotian chapter part-of-speech tagging corpus, then reduces difference using Adam algorithm Away from training the parameter of BiLSTM+CRF model with this, until model reaches stable, i.e., for gap value close to 0, model reaches stable Afterwards, so that it may obtain the Laotian part-of-speech tagging model based on BiLSTM and CRF, the sentence inputting of part-of-speech tagging will be needed to always The input layer of Laos's words and phrases marking model, the CRF layers of part of speech that will export each word in sentence.

The beneficial effects of the present invention are:

1, present invention employs the BiLSTM structure of deep learning, BiLSTM structure has not information before and after study sentence Wrong effect.

2, the present invention uses CRF model, and CRF model can be considered influencing each other between part of speech, connect in BiLSTM structure The last layer, it is highly effective to the selection of part of speech.

3, the present invention the experimental results showed that, the present invention propose Laotian part-of-speech tagging model part-of-speech tagging accuracy rate be higher than All traditional statistical models.

Detailed description of the invention

Fig. 1 is the overview flow chart in the present invention；

Fig. 2 is the BiLSTM+CRF model of Case-based Reasoning.

Specific embodiment

In order to describe in more detail the present invention and convenient for the understanding of those skilled in the art, with reference to the accompanying drawing and embodiment pair The present invention is further described, and the embodiment of this part for illustrating the present invention, do not come with this by the purpose being easy to understand The limitation present invention.

Embodiment 1: as shown in Figs. 1-2, a kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model is specific to walk It is rapid as follows:

Step1, BiLSTM+CRF model

As shown in Figure 2, Laos's sentence of 3 words has been used(Ministry of Finance says), BiLSTM+CRF model and workflow are explained.

(1) input layer:

Input layer is used to input 3 words of sentence, and each word can enter in term vector matrix, find oneself word to Amount.The term vector of 3 words also will enter into LSTM layers of forward, backward in corresponding LSTM structure, carry out calculating word information；

(2) forward direction LSTM layers:

LSTM layers of forward direction are made of 3 LSTM (L).SentenceIn (Ministry of Finance says) Each word term vector can enter corresponding L unit calculate word information: forward-facing state information (FS) and forward direction output information (FH).With L₁For, the first word of sentenceThe term vector of (national treasury) enters L₁In be calculated: forward-facing state information (FS₁) and forward direction output information (FH₁).Forward-facing state information (FS₁) can be handed in the layer always, participate in next LSTM (L₂) calculating, and forward direction output information (FH₁) full articulamentum calculating part of speech probability distribution will be output to；

(3) LSTM layers backward:

Backward LSTM layers are made of 3 LSTM (R).Working method is identical as forward direction LSTM layer, but LSTM presses input sequence Reverse connection；

(4) full articulamentum:

Full articulamentum is made of 3 simple neural network units (Cell), and each Cell receives forward and backward output information (FH and BH) is as input.With Cell₂For, explain the calculating and output information content of this layer: Cell₂It receives from backward LSTM layers of BH₂With from the preceding FH to LSTM layers₂As input value, in Cell₂It is middle to obtain word by calculating The part of speech probability distribution of (declaring), will enter into CRF layers；

(5) CRF layers:

By connecting available sentence entirelyEach word in (Ministry of Finance says) it is general Rate distribution, needs these set being input to CRF layers.CRF layers can guarantee to select greater probability part of speech from each distribution simultaneously, Also it will consider influencing each other between part of speech；

The training of Step2, BiLSTM+CRF model

In the present embodiment, Laotian chapter part-of-speech tagging corpus is the more Laotian articles for being marked part of speech, with article In for one section of sentence:(beauty State does not announce advertising expenditure).

BiLSTM is based on LSTM structure, and LSTM structure is time recurrent neural network, suitable for processing time series The long task in middle interval, such as: machine translation, image recognition, part-of-speech tagging etc..Since LSTM structure is time recurrence , therefore using unidirectional LSTM structure when carrying out part-of-speech tagging task to sentence, contextual information, Zhi Nengli cannot be utilized With unidirectional information.And the introduction of BiLSTM is exactly to solve this problem, can use contextual information to carry out part of speech mark Note.By a sentence inputting to part-of-speech tagging into BiLSTM, BiLSTM can export the word of each word in sentence by calculating Property probability distribution, traditional way can select the maximum probability part of speech of each distribution, as part-of-speech tagging result.But in this way The influence between part of speech is not accounted for, such as: verb etc. cannot be connect after quantifier.Therefore CRF model is introduced to ask to solve this CRF model, can be connected to the output layer of BiLSTM by topic.It, can using the Laotian part-of-speech tagging model based on BiLSTM and CRF Effectively to carry out part-of-speech tagging to Laotian, therefore the present invention has certain research significance.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model, characterized by the following steps:

The building of Step1, BiLSTM+CRF model

Laotian part-of-speech tagging model based on BiLSTM and CRF comprising five layers: input layer, LSTM layers of forward direction, backward LSTM Layer, full articulamentum and CRF layers；

(1) input layer:

The received data of input layer are the Laos sentence W with n word₁…W_t…W_n, word enter BiLSTM before need to be converted to The form of number just can be carried out calculating, therefore input one term vector matrix of layer building, each Laos's word can word to Find its corresponding term vector in moment matrix, the value of term vector represents the feature of the word, and term vector will also represent before word is input to To LSTM layers, it is LSTM layers backward in corresponding LSTM, carry out calculating word information；

(2) forward direction LSTM layers:

LSTM layers of forward direction are made of LSTM, and LSTM determines reservation, output and the deletion of information, every in input layer Laos sentence The term vector of a word will be sequentially input in corresponding LSTM, and LSTM is connected by input sequence forward direction, and each LSTM exports two Segment information: forward-facing state information FS and forward direction output information FH, information are all presented with a matrix type, forward-facing state information meeting It is handed in the layer always, participates in the calculating of next LSTM, and forward direction output information will be output to full articulamentum and calculate word Property probability distribution；

(3) LSTM layers backward:

Backward LSTM layers are also to be made of LSTM, and the term vector of each word is sequentially input to correspondence in input layer Laos sentence LSTM in, but LSTM presses input sequence reverse connection, and each LSTM exports two parts word information: backward status information BS and after To output information BH, backward status information will be handed in the layer always, participate in the calculating of next LSTM, and be exported backward Information will be output to full articulamentum and calculate part of speech probability distribution；

(4) full articulamentum:

Full articulamentum is made of simple neural network unit, and each received data of unit are exported by LSTM layers of forward and backward Forward direction output information FH, backward output information BH, FH and BH in unit back warp cross calculating, part of speech probability distribution will be obtained；

(5) CRF layers:

After full articulamentum obtains the probability distribution of each word, CRF model is distributed as sentence using these and calculates best part of speech mark Sequence is infused, CRF layers while guaranteeing to select greater probability part of speech from each distribution, will also consider the mutual shadow between part of speech It rings；

The training of Step2, BiLSTM+CRF model

Training BiLSTM+CRF model uses Laotian chapter part-of-speech tagging corpus, it may be assumed that more are marked the Laos of part of speech Chinese language chapter, training use the log-likelihood function based on sentence level first, to calculate the part of speech probability point that full articulamentum obtains The gap that part of speech is really distributed in cloth and Laotian chapter part-of-speech tagging corpus, then reduces gap using Adam algorithm, with The parameter of this training BiLSTM+CRF model, until model reaches stable, i.e., gap value is close to 0, after model reaches stable, so that it may To obtain the Laotian part-of-speech tagging model based on BiLSTM and CRF, the sentence inputting of part-of-speech tagging will be needed to Laos's words and phrases The input layer of property marking model, the CRF layers of part of speech that will export each word in sentence.