CN105740226A

CN105740226A - Method for implementing Chinese segmentation by using tree neural network and bilateral neural network

Info

Publication number: CN105740226A
Application number: CN201610037336.4A
Authority: CN
Inventors: 黄积杨; 赵志宏; 张冲
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-01-15
Filing date: 2016-01-15
Publication date: 2016-07-06

Abstract

The invention relates to a method, a system, a device and a computer program for implementing Chinese segmentation by using a tree neural network and a bilateral neural network. The method comprises the steps of converting each character in an input sentence into a character vector as a first input sequence serving as the input of a three-layer long-short term memory neural network, i.e., the tree neural network, meanwhile, generating a second input sequence using a sentence vector as an initial value of each hidden layer, transmitting the second input sequence to the bilateral long-short term memory neural network, generating a third input sequence using the sentence vector as an initial value of each hidden layer, transmitting the third input sequence to a logSoftMax layer, i.e., a multi-classification layer, and finally generating a segmentation labeling sequence.

Description

Tree-like neutral net and two way blocks is used to realize Chinese word segmentation

Technical field

The invention belongs to natural language processing field, be directed to use with tree-like neutral net and method that two way blocks realizes Chinese word segmentation.

Background technology

Conventional traditional Chinese participle technique includes by word traversal, based on the segmenting method etc. of the segmenting method of dictionary dictionary coupling, full cutting and the frequency statistics based on word, and these methods are all the modes based on algorithm.Also having the segmenting method based on model that two comparisons are famous in traditional method, hidden Markov model, conditional random field models, the two model is all by sequence, obtains target sequence, and wherein conditional random field models effect is better than hidden Markov model.Along with the maturation of the lifting of computer computation ability and neural network model, a kind of method using tree-like neutral net two way blocks to realize Chinese word segmentation is proposed here.

Summary of the invention

It is an object of the invention to propose a kind of method based on neural fusion Chinese word segmentation at least to a certain extent.Illustrate how the mark of word segmentation sequence that the sentence generation by inputting is corresponding.

In order to realize object above, the technical solution used in the present invention is: obtain input sentence, each word in sentence converts to word vector input as first, first input is passed to three layers shot and long term Memory Neural Networks and tree-like neutral net produces the second input, thus realizing phrase, the extraction of semantic information, second input is passed to two-way shot and long term Memory Neural Networks, and the initial input of hidden layer is initialized by special mode, produce the 3rd input, thus realizing the extraction of word contextual information, 3rd input is passed to logSoftMax layer is layer of classifying more, obtain final mark of word segmentation sequence.In order to obtain tree-like information, it is necessary to each network is individually trained, more whole neutral net is trained.

The details of some embodiments of theme described in the following drawings and this specification described in illustrating.According to illustrating, drawings and claims book, it can be apparent for using tree-like and two way blocks to realize other features of method of Chinese word segmentation, aspect and advantage.

Accompanying drawing explanation

Fig. 1 illustrates whole neural network structure

Fig. 2 illustrates part three layers shot and long term Memory Neural Networks

Fig. 3 illustrates a two-way-shot and long term Memory Neural Networks

Detailed description of the invention

Below in conjunction with the accompanying drawing in the present invention, whole technical scheme and whole neutral net are carried out clearly, complete explanation.

Present disclosure is in that to provide a kind of technical solution carrying out Chinese word segmentation based on neutral net, including four parts, sentence is converted to vector portion, training three layers shot and long term Memory Neural Networks and tree-like part of neural network, train two-way shot and long term Memory Neural Networks part, train whole neutral net.

Fig. 1 illustrates from input sentence to the whole flow process of final sentence mark of word segmentation sequence output.Wherein input sentence is the example of the system that sentence converts to term vector to list entries.Following system, assembly and technology can be implemented wherein.

Word converts to term vector, and term vector has two ways to obtain, 1) using term vector as parameter, it is included in the middle of neutral net, while training whole neutral net, is obtained with term vector.But the term vector obtained in this way, this relation inconspicuous of similar Chinese character, even without inevitable contact.2) the neutral net training in advance utilizing comparative maturity goes out term vector storehouse, such as word2vec, GloVe, the term vector that the two neural network algorithm trains out, there is certain linear relationship or obvious non-linear relation between similar word or similar word, its similar word can be found by the term vector of a word.So that term vector has more semanteme, the present invention adopts Glove to train the term vector storehouse of 300 dimensions.

The number N of word in statistics language material, use oneHot (oneHot represents that a dimension is N, only one of which position be 1 other be the vector of 0) represent each word, find, by oneHot, the vector that word is corresponding, sentence converts vector representation to the most at last.

Fig. 2 shows part three layers shot and long term Memory Neural Networks, each layer of shot and long term Memory Neural Networks, is made up of LSTM (shot and long term memory) node of 100 standards.The LSTM of standard mainly processes variable length sequence, solves long-distance dependence problem, and it includes three doors: input gate, forget door, out gate.Multilamellar shot and long term Memory Neural Networks is used to be equivalent to define a tree-like neutral net.

nullIn order to use three layers shot and long term Memory Neural Networks to have tree-like function，The input training this layer network is sentence vector，The sequence that target is the syntax parsing tree that this input sentence is corresponding represents，Such as: input={ " uses tree-like neutral net and two way blocks to realize Chinese word segmentation " }，Target={ " (ROOT (IP (VP (VP (VV use) (NP (NP (NN is tree-like) (NN neural) (NN network)) (CC and) (NP (ADJP (JJ is two-way)) (NP (NN is neural) (NN network))))) (VP (VV realizations) (NP (NN is Chinese) (NN participle)))))) " }，Individually when training，Need to add a linear transformation layer and a logSoftMax layer at this layer network，The output making the LSTM of the standard of 100 nodes can represent corresponding with tree-like sequence，Be equivalent to coding and decoding.The init state of the hidden layer of traditional shot and long term Memory Neural Networks is full 0 or generates only small random number, original state for this three layers shot and long term Memory Neural Networks, it is vectorial that the present invention adopts sentence2vec (neural network algorithm that sentence is changed the vector that forms a complete sentence) to generate the sentence representing input sentence, sentence vector converts to and the vector of hidden layer identical dimensional by being multiplied by matrix parameter, and matrix parameter obtains by training whole neutral net.

Fig. 3 shows two-way shot and long term memory (BIDIRECTIONAL-LSTM) neutral net.One two-way-shot and long term Memory Neural Networks includes one and is made up of backward front circular recursion-length Memory Neural Networks the circular recursion-shot and long term Memory Neural Networks transmitted after forward direction and one, each circular recursion-shot and long term Memory Neural Networks is made up of the LSTM mnemon of designated length and block number, and the sequence length that adopts here is the longest is 100.Each unit includes input gate, forgets door and out gate, i.e. the LSTM mnemon of standard.Two-way-shot and long term Memory Neural Networks, can capture the information of each word the right and left, so obtaining semanteme better.The init state of the BIDIRECTIONAL-LSTM two ends hidden layer of usual standard is full 0 or generates only small random number, with three layers shot and long term Memory Neural Networks is the same above, it is vectorial that the present invention adopts sentence2vec (neural network algorithm that sentence is changed the vector that forms a complete sentence) to generate the sentence representing input sentence, sentence vector converts to and the vector of hidden layer identical dimensional by being multiplied by matrix parameter, and matrix parameter obtains by training whole neutral net.

Output sequence in Fig. 1, arranging logsoftMax layer in output sequence namely to classify layer, each output produces a column vector, and the dimension of column vector is 4, this 4 expression BEMS mark, wherein B refers to Begin prefix, and E refers to End suffix, and M refers in Middle word, S refers to monosyllabic word, taking out maximum of which probit, find the mark of correspondence position, this mark inputs the mark of the word of sentence correspondence position exactly.Same operation is done in all of output, final acquisition mark of word segmentation sequence.

In order to individually train two-way shot and long term memory (BIDIRECTIONAL-LSTM) neutral net and logsoftMax layer namely to classify layer, the input inputting the three layers shot and long term Memory Neural Networks that sentence passes through above to train is remembered the input of (BIDIRECTIONAL-LSTM) neutral net as two-way shot and long term, and target is the mark of word segmentation sequence that this sentence is corresponding.

The above is the complete explanation to whole neural network structure and processing procedure.Finally need to train whole neutral net, just can use, input is a sentence, target is a mark of word segmentation sequence, as: input={ " uses tree-like neutral net and two way blocks to realize Chinese word segmentation " }, target={ " BEBEBEBESBEBEBEBEBEBE " }. during use, it is only necessary to input a sentence, it is possible to output mark of word segmentation sequence.

Although this specification comprises some specific implementation mode details, but these are not construed as the restriction to any invention or required scope, are only intended to the explanation of the feature of specific embodiment.All same equivalent effect changes made according to the thinking of the present invention, all should be covered by protection scope of the present invention.

Claims

1. the method using tree-like two way blocks to realize Chinese word segmentation, comprises the following steps: obtaining input sentence, described input sentence includes multiple inputs of grammaticalness order；Use language model that word each in described sentence converts to word vectorial as the first list entries, the first described list entries is passed to three layers shot and long term Memory Neural Networks and tree-like neutral net, simultaneously according to described input sentence generation sentence vector as the initialization input of every layer of hidden layer in three layers shot and long term Memory Neural Networks, training three layers shot and long term Memory Neural Networks, produce the second list entries, the second list entries is passed to two-way shot and long term Memory Neural Networks again, input as the initialization of two-way shot and long term Memory Neural Networks hidden layer according to described input sentence generation sentence vector simultaneously, produce the 3rd list entries, it is layer of classifying that the 3rd described list entries passes to logSoftMax layer more, to produce the mark of word segmentation sequence of described input sentence.

2. method according to claim 1, wherein said input sentence is no more than the variable-length input sentence of designated length.

3. method according to claim 1, wherein said language model refers to the neural network model that word or word convert to term vector.

4. the method according to Claim 1-3 any one, wherein processes described input sentence and includes: the unidentified item in described input sentence replaces to appointment labelling to produce modified input sentence.

5. the vector representation being converted to by described input sentence by ripe neural network model is referred to according to the sentence vector that claim 1 is wherein said.

6. method according to claim 1, the input that initializes of wherein said hidden layer includes two-way shot and long term Memory Neural Networks hidden layer by the init state after forward direction and by backward front init state, and the init state of three layers shot and long term Memory Neural Networks every layer, all adopt the sentence vector of described sentence.

7. the method according to claim 1 to 6 any one, farther includes: use stochastic gradient descent to train described three layers shot and long term Memory Neural Networks and two-way shot and long term Memory Neural Networks.

8. the method according to claim 1 to 7 any one, wherein said input sentence is consistent with the sentence of grammer, and mark of word segmentation sequence is by the character string of 4 kinds of tag combination.

9. 4 tag combination according to claim 8 refer to BMES, and wherein B refers to that Begin represents that prefix, E refer to that End represents that suffix, M refer to that Middle represents in word, and S refers to that Single represents single word.

10. method according to claim 1, wherein training three layers shot and long term Memory Neural Networks refers to the extra linear transformation of interpolation and logSoftMax layer, using the vector representation of described sentence as input, it is shown as target, training network parameter with the sequence table of the syntax parsing tree of described sentence.