CN106547737A

CN106547737A - Based on the sequence labelling method in the natural language processing of deep learning

Info

Publication number: CN106547737A
Application number: CN201610950893.5A
Authority: CN
Inventors: 郑骁庆; 陈易; 林孟潇
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-10-25
Filing date: 2016-10-25
Publication date: 2017-03-29
Anticipated expiration: 2036-10-25
Also published as: CN106547737B

Abstract

The invention belongs to Computer Natural Language Processing technical field, specially based on the sequence labelling method in the natural language processing of deep learning.The present invention can be used for the sequence labelling task for including the various natural languages such as Chinese word segmentation, English shallow parsing, Chinese and English part-of-speech tagging and name identification.Using depth learning technology, for the sentence being input into, the tag types of each component units in sentence are exported by computer program.The key of the sequence labelling method includes：Rapid serial mark network structure and learning algorithm, comprehensive front network structure and accelerating algorithm to label information based on deep learning, and integration and the integration mode of these key technologies.The system realized based on deep learning possesses parameter small scale, the fast advantage of operating speed, it is very suitable for the limited environment of computing resource, can be deployed on the relatively limited mobile computing platform of the computing resources such as mobile phone, can significantly improve system response time and user satisfaction.

Description

Based on the sequence labelling method in the natural language processing of deep learning

Technical field

The invention belongs to Computer Natural Language Processing technical field, and in particular to sequence mark in a kind of natural language processing Injecting method.

Background technology

Deep learning is the progress of recent making a breakthrough property of artificial intelligence study, and it finishes artificial intelligence and is up to 10 years not There can be the situation of breakthrough, and it is rapid in industrial quarters generation impact.Deep learning is different from Narrow artificial intelligence system（Towards the functional simulation of particular task）, as general artificial intelligence technology, can tackle Various situations and problem, obtain extremely successfully applying in fields such as image recognition, speech recognitions, lead in natural language processing Domain（It is mainly English）Also obtain certain effect.Deep learning is to realize that artificial intelligence is most effective at present, be also acquirement effect most Big implementation.

Relatively conventional art, the system realized based on deep learning are also equipped with parameter small scale, the fast advantage of operating speed, It is very suitable for the limited environment of computing resource.

In natural language processing field, for sequence labelling, including Chinese word segmentation, English shallow parsing, Chinese and English The sequence labelling problem in various natural language processings such as part-of-speech tagging and name identification, although existing based on depth network Method has been able to reach the performance similar to traditional method, but the parameter amount that its model is included is still more, and use time is still So longer, mark performance need further to improve.For the problems referred to above, the present invention proposes a kind of new based on deep learning Rapid serial mask method, not only accelerate by a relatively large margin network mark needed for training and use time, while can be comprehensive To tag types improving the accuracy of mark before closing.

The content of the invention

It is an object of the invention to propose sequence mark in the short natural language processing of a kind of mark accuracy height, required time The method of note.

The method of sequence labelling in the natural language processing that the present invention is provided, be with computer to read statement, according to The tag set of task definition, is each component units in sentence（Word or word）Corresponding label is selected by its appearance order Type.The method can be used for including that Chinese word segmentation, English shallow parsing, Chinese and English part-of-speech tagging and name identification etc. are each Plant the sequence labelling task in natural language.

Sequence labelling refers to the tag set according to task, for the sentence of input, exports sentence by computer program In each component units（Such as：The word of the word or English of Chinese）Tag types.By taking Chinese word segmentation as an example, general employing includesB、I、EWithSFour labels, represent starting word, middle word, terminating word and individually into the word of word for word respectively.If " I likes meter for input Calculation machine.", correct annotation results for "S B E B I E S”（Punctuation mark typically also serves as component units and treats）, will sentence Son be divided into " I/like/computer/.”.The characteristics of sequence labelling method of the present invention is that mark speed is fast, system Configuration requirement is low（Suitable for calculating and the limited equipment of storage resource）, accuracy it is high.

In the natural language processing that the present invention is provided, the method for sequence labelling, comprises the following steps that：

（1）Each component units to corresponding language（Such as：The word of the word or English of Chinese）Training one vector representation, this to Amount expression can be generated at random or carry out pre-training using unsupervised method（Such as：English word can be adopted Word2Vector instruments [1], Chinese character can adopt the method described in list of references [2]）, after training, by search to The mode of scale is by each cell translation into corresponding vector representation.

（2）The tag set of various sequence labelling tasks is defined, determines which every kind of sequence labelling task marked respectively including Sign.By taking Chinese word segmentation as an example, can adopt includesB、I、E、STag set, represent the beginning word of word, middle word, knot respectively Beam word, the independent word into word.

（3）Prepare sequence in the natural language processings such as Chinese word segmentation, English shallow parsing, part-of-speech tagging, name identification The language material of row mark task.

（4）Network structure is marked using rapid serial（As shown in Figure 1）Or the comprehensive front network structure to label information（Such as Shown in Fig. 2）, combined with Max-margin calculation using Perceptron-style algorithms or Perceptron-style algorithms Method is trained to network.

As carried out network training using the rapid serial mark network structure and learning algorithm based on deep learning, which is quick Sequence labelling network structure as shown in figure 1, wherein, each component units in sentence（Such as：The word of the word or English of Chinese）It is logical The mode for crossing lookup vector table is converted into corresponding vector representation；The vector splicing of each component units and the unit around which Into window feature matrix；Using one-dimensional convolution by window feature matrix conversion into window feature vector representation；To window feature to Amount carries out nonlinear transformation and linear transformation successively, exports the dimension vector equal with task number of labels, vectorial each element Represent the probability of corresponding label；Combination tag transition probability matrix, obtains a probability highest using Viterbi decoding algorithm Sequence label as annotation results.

Concrete methods of realizing is：The label of one component units is typically related to its surrounding cells, thus network adopts window Mouth mold type, i.e., when estimating that active cell belongs to the probability of certain label, using the unit of this unit and surrounding as defeated Enter.If window size is arranged to 5, then it represents that using each two units of this unit and its left side and the right as input window. If the character quantity on the left side and the right is not enough to the size that window specifies, replaced using special filler.

Unit in each input sentence will be converted into corresponding vector representation by way of searching vector table.It is each The expression of individual unit can be generated at random or carry out pre-training using unsupervised method（Such as：English word can be adopted Word2Vector instruments [1], Chinese character can adopt the method described in list of references [2]）.The ginseng being stored in vector table Number also can be constantly adjusted in training.These vectors are spliced into into eigenmatrix afterwards, the columns of eigenmatrix is window Size, each vector representation for being classified as corresponding unit.

Then one-dimensional convolution algorithm is carried out to eigenmatrix, one-dimensional convolution is referred to for each row vector dot product of eigenmatrix Corresponding parameter vector（Convolution kernel）, using different convolution kernels when different rows vector carries out dot product operations.In one-dimensional convolution Under effect, eigenmatrix is converted into and unit vector dimension identical vector, and the character representation of a certain window of the vector representation can With regard as active cell around under the influence of unit produced by semantic feature represent.The reason for using one-dimensional convolution is not only The parameter of model is reduced, and accelerates the training of model and using the required time.For example：Compared with list of references [3] and [4] Methods described, number of parameters needed for model from（d×w×h）Drop to（d×w）, wherein：dFor unit（Word or word）Vector dimension Degree；wFor window size；hFor middle hidden neuron quantity.

After being then passed through a linear net network layers（Middle hidden layer）, carried out using Sigmoid or hardTanh functions non- Linear conversion, finally reuses a linear layer, exports the vector equal with task number of labels, vectorial each element representation The probability of corresponding label.

A sentence is given, with window slip from left to right, network can export a matrix, each in matrix Elementf _θ（t|i）Represent the in sentenceiIndividual unit belongs to labeltProbability estimation, whereinθRepresent the parameter of network. In sequence labelling task, due to there is very strong dependence between in front and back's label, matrix is introducedA _ijRepresent from labeliJump to mark SignjProbability（It is also contained in parameter setsθIt is interior）.Given one containsnThe sentence of individual units _[1:n], can be isometric for certain Sequence labelt _[1:n]Carry out estimating point：

Score（s _[1:n],t _[1:n],θ）=（Formula 1）

In the case where network parameter is given, we can obtain a score value highest label sequence using Viterbi decoding algorithm Row are used as annotation results.

The method of training is in training set, it is desirable to the maximum probability that the correct annotated sequence of each sample occurs：

（Formula 2）

Wherein：（s, t）A sample in expression training set.Training adopts gradient descent method, and all parameters of network are using following Formula is updated：

（Formula 3）

Wherein：λRepresent Learning Step.

In 3 right side local derviation of computing formula, in order to avoid Index for Calculation exceedes double-precision number span and calculates complicated Property higher problem, using Perceptron-style algorithms, i.e., only calculate the direction of parameter adjustment, and its size be all solid Definite value 1, so as to simplify the calculating process of parameter adjustment, accelerates training speed.Concrete calculating process is as follows：Current network is joined Several lower annotated sequences for comparing top score and correct annotated sequence, it is such as inconsistent, there is inconsistent position, setting causes The local derviation of the outgoing position of mistake annotated sequence is 1, and the outgoing position local derviation of the correct annotated sequence of correspondence is+1.Same Local derviation computational methods are also applied for transfer matrixA _ij.In combination with using Max-margin methods when model parameter is trained, i.e., not The highest scoring of correct annotated sequence is required nothing more than, and requires that its score value and the difference of incorrect annotated sequence top score exceed rule Fixed threshold value.

If carrying out network training, Partial key network to the network structure and accelerating algorithm of label information before adopting comprehensively Structure as shown in Fig. 2 wherein, each component units in sentence（Such as：The word of the word or English of Chinese）With each task mark Sign corresponding vector representation is converted into by way of searching vector table；The vector of each component units and the unit around which Window feature matrix is spliced into, and splicing is carried out by each possible label vector and window feature matrix one by one and produced containing front To the window feature matrix of label；Feature of the front window feature matrix conversion to label into higher will be contained using one-dimensional convolution Vector representation；Carry out nonlinear transformation and linear transformation, output dimension and task number of labels to the high-rise vector representation successively Equal vector, the probability of vectorial each element representation corresponding label；Combination tag transition probability matrix, using Viterbi solution Code algorithm obtains a probability highest sequence label as annotation results.

Concrete methods of realizing is：By vector table, each label is also corresponded to into a vector representation, and will be each Individual possible label vector is carried out side by side with the eigenmatrix of current window, and it is corresponding to produce to carry out similar one-dimensional convolution afterwards Semantic feature represent（The front unit of hypothesis is certain label）.Network can be that each of each sentence component units may Forward direction label export the vector equal with task number of labels, vectorial each element equally represents the possibility of corresponding label Property.Conjugative tiansfer matrixA _ij, a score value highest sequence label is obtained as annotation results using Viterbi decoding algorithm.

Window semantics character representation under assuming to label before difference is calculated can be shared to intermediate result, so as to Accelerate network calculations speed.Specifically accelerated method is（Practical Calculation step is as shown in the numeral in Fig. 2）：It is heavy which is calculated first Folded part, i.e., to intermediate result during label before not considering；Then calculate the part that different label vectors affect；Finally by Between result plus label affect part draw final calculation result.

When the sequence labelling task of natural language processing is carried out, the tag types of active cell not only with context（Week Enclose unit）Correlation, it is also relevant with its forward direction tag types.For example：In " alliance " of Chinese, if the label of " connection " isB（Word Start）, the label of " alliance " is probablyE（The end of word）, i.e., " alliance " is a word, it is also possible toI（The centre of word）, such as " join Party of alliance "；But if the label of " connection " isI, the label of " alliance " is most probablyE, such as " the Federal Republic of Yugoslavia ".English part-of-speech tagging task There is analogue, such as：" work " can make noun, can also make verb.In " that work " phrase, if " that " is marked Note as determiner, then " work " is likely to noun, and if " that " is noted as relative pronoun, " work " is then likely to Verb.Thus the accuracy of all kinds of sequence labelling tasks before considering in a model, can be improved to the type of label.

（5）After newly-increased and extension language material, identical training algorithm can be adopted on training network parameter basis to parameter It is adjusted, or re -training network completely.Training method in detail is shown in step（4）It is described.

（6）After network training terminates, in the case where network parameter is given, one point is obtained using Viterbi decoding algorithm Result of the value highest sequence label as mark.

The above-mentioned sequence labelling method based on deep learning, feature is：

（1）It is special to produce the semanteme of window using one-dimensional convolution in the depth network for the sequence labelling of natural language processing Expression is levied, the number of parameters of network model is reduced, training and the use time of network is accelerated；

（2）For the comprehensive front depth network structure and accelerating algorithm to label information of sequence labelling task；

（3）The depth algorithm for training network combined using Perceptron-style algorithms and Max-margin, that is, improved The effect of training, accelerates the calculating process of network parameter adjustment, again so as to the training for reducing network and secondary customization time；

（4）Suggestion using word or term vector dimension be 50 ~ 300, window size is 3 or the function of 9, non-linear layer is The network configuration of Sigmoid or hardTanh.

Invention effect

Model in sequence labelling method institute development system based on deep learning disclosed by the invention is using comprising with representative Property field sample training set study after, performance as shown in table 1 is reached on test set：

1. model of table marks Performance comparision

Table 1 also compares the performance of typical network model at present.Conv-S represents disclosed by the invention based on deep learning Sequence labelling method（To label condition before not considering）, Conv-J represent combine before to label information web results.English Part of speech analysis is using accuracy index, and other three tasks use F1 indexs.F1 index calculating methods are 2PR/（P + R）, WhereinPFor accuracy rate,RFor recall rate.Comparison of the table 2 for each model operating speed, in addition to Conv-S, other models are listed relatively Multiple the time required to Conv-S models.With reference to Tables 1 and 2 as can be seen that Conv-J in various tasks, performance is most It is good, and Conv-S when in use between be greatly reduced in the case of, show very competitive performance.

Compare the time required to the mark of table 2.

。

Term is explained

Natural language processing：An important branch in computer science and artificial intelligence field, research can realize people with The various theoretical and methods of efficient communication are carried out between computer with natural language.Natural language processing is not usually to study Natural language, and be to develop the computer system that can be effectively realized natural language communication, software system particularly therein；

Sequence labelling：According to given tag set, for the sentence of input, exported by computer program and respectively constituted in sentence Unit（Such as：The word of the word or English of Chinese）Tag types.By taking Chinese word segmentation as an example, general employing includesB、I、EWithSFour Label, represents starting word, middle word, terminating word and individually into the word of word for word respectively.If " I likes computer for input.", Correct annotation results for "SB E B I E S”（Punctuation mark typically also serves as component units and treats）, will sentence be divided into " I/like/computer/.”.

Description of the drawings

Fig. 1. the rapid serial mark depth network structure based on one-dimensional convolution.

Fig. 2. the comprehensive front sequence labelling depth network key local structural graph to label information.

Specific embodiment

The invention discloses a kind of method that employing computer carries out sequence labelling in natural language processing automatically.To input Sentence, according to the tag set of task definition, is each component units in sentence（Word or word）Select by its appearance order Corresponding tag types, this is referred to as sequence labelling task in natural language processing field.Sequence labelling task can be used for Chinese The various natural language processing tasks such as participle, English shallow parsing, Chinese and English part-of-speech tagging and name identification.Concrete steps It is as follows：

（1）Each component units to corresponding language（Such as：The word of the word or English of Chinese）Correspondence one vector representation, this to Amount expression can be generated at random or carry out pre-training using unsupervised method（Such as：English word can be adopted Word2Vector instruments [1], Chinese character can adopt the method described in list of references [2]）, can be by searching after training The mode of vector table is by each cell translation into corresponding vector representation.

（3）Prepare sequence in the natural language processings such as Chinese word segmentation, English shallow parsing, part-of-speech tagging, name identification The language material of row mark task.With Chinese word segmentation as row, previous column is word or punctuation mark in sentence, and rear string is corresponding mark Sign.In whole corpus, separated with a null between each sentence.

IS

HappinessB

VigorouslyE

MeterB

CalculateI

MachineE

。S。

As using the rapid serial mark network structure and learning algorithm based on deep learning, its rapid serial marks network Structure is as shown in Figure 1.Specially：Each component units in sentence（Such as：The word of the word or English of Chinese）By searching vector The mode of table is converted into corresponding vector representation；The vector of each component units and the unit around which is spliced into window feature Matrix；Using one-dimensional convolution by window feature matrix conversion into window feature vector representation；Window feature vector is carried out successively Nonlinear transformation and linear transformation, export the dimension vector equal with task number of labels, and vectorial each element representation correspondence is marked The probability of label；Combination tag transition probability matrix, obtains a probability highest sequence label using Viterbi decoding algorithm As annotation results.

Unit in each input sentence will be converted into corresponding vector representation by way of searching vector table, afterwards These vectors are spliced into into eigenmatrix, the columns of eigenmatrix is window size, each vector representation for being classified as corresponding unit. Then one-dimensional convolution algorithm is carried out to eigenmatrix, one-dimensional convolution is referred to joins accordingly for each row vector dot product of eigenmatrix Number vector（Convolution kernel）, using different convolution kernels when different rows vector carries out dot product operations.It is in the presence of one-dimensional convolution, special Levy matrix conversion into unit vector dimension identical vector, the character representation of a certain window of the vector representation can regard as Active cell around under the influence of unit produced by semantic feature represent.

Score（s _[1:n], t _[1:n],θ） =（Formula 1）

（Formula 2）

（Formula 3）

Wherein：λRepresent Learning Step.

If using comprehensive front network structure and accelerating algorithm to label information, Partial key network structure such as Fig. 2 institutes Show, wherein, each component units in sentence（Such as：The word of the word or English of Chinese）With each task label by search to The mode of scale is converted into corresponding vector representation；It is special that the vector of each component units and the unit around which is spliced into window Matrix is levied, then splicing is carried out from each possible label vector and window feature matrix one by one and produced containing the front window to label Eigenmatrix；To be represented into the characteristic vector of higher to the window feature matrix conversion of label containing front using one-dimensional convolution；It is right The high-rise vector representation carries out nonlinear transformation and linear transformation successively, exports the dimension vector equal with task number of labels, The probability of each element representation corresponding label of vector；Combination tag transition probability matrix, is obtained using Viterbi decoding algorithm One probability highest sequence label is used as annotation results.

List of references

[1] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR abs/1301.3781, 2013.

[2] Xiaoqing Zheng, JiangtaoFeng, Mengxiao Lin, Wenqiang Zhang. Context- specific and multi-prototype character representations. In Proc. The Twenty- Fifth International Joint Conference on Artificial Intelligence （IJCAI’16）, 2016.

[3] Ronan Collobert, Jason Weston, Léon Bottou, MichaelKarlen, KorayKavukcuoglu, and PavelKuksa.Natural language processing （almost） from scratch.Journal of Machine Learning Research, 12:2493–2537, 2011

[4] Xiaoqing Zheng, Hanyang Chen, and TianyuXu. Deep learning for Chinese word segmentation and postagging. In Proceedings of the International Conferenceon Empirical Methods in Natural Language Processing（EMNLP’13）, 2013.

[5] Wenzhe Pei, Tao Ge, and Baobao Chang.Maxmargintensor neural network for chinsese word segmentation.In Proceedings of the 52nd Annual Meetingof the Association for Computational Linguistics（ACL’14）, 2014.

[6] Pengfei Liu, ShafiqJoty, and Helen Heng.Finegrainedopinion mining with recurrent neural networksand word embeddings. In Proceedings of the InternationalConference on Empirical Methods in NaturalLanguage Processing （EMNLP’15）, 2015。

Claims

1. a kind of sequence labelling method in natural language processing based on deep learning, be with computer to read statement, According to the tag set of task definition, it is that each component units i.e. word or word in sentence selects corresponding by its appearance order Tag types；Characterized in that, concretely comprising the following steps：

（1）For each component units one vector representation of correspondence of corresponding language, the vector representation can generate at random or Pre-training is carried out using unsupervised method, after training, by each cell translation into phase by way of searching vector table The vector representation answered；

（2）The tag set of various sequence labelling tasks is defined, determines which label is every kind of sequence labelling task include respectively；

（3）Prepare sequence mark in the natural language processings such as Chinese word segmentation, English shallow parsing, part-of-speech tagging, name identification The language material of note task；

（4）Network structure or the comprehensive front network structure to label information are marked using rapid serial, using Perceptron- The style algorithms or Perceptron-style algorithms algorithm that combined with Max-margin is trained to network；

As carried out network training using the rapid serial mark network structure and learning algorithm based on deep learning, its rapid serial In mark network structure, the label of a component units is related to its surrounding cells, thus network adopts window model, that is, estimating When meter active cell belongs to the probability of certain label, using the unit of this unit and surrounding as input；If window is big It is little to be arranged to 5, then it represents that using each two units of this unit and its left side and the right as input window；If the left side and the right Character quantity be not enough to the size that window specifies, then replaced using special filler；

Unit in each input sentence is converted into corresponding vector representation by way of searching vector table；Each unit Expression generate or pre-training carried out using unsupervised method with random；Parameter in vector table is stored in also in training Constantly adjusted；These vectors are spliced into into eigenmatrix afterwards, the columns of eigenmatrix is window size, each to be classified as right Answer the vector representation of unit；

Then one-dimensional convolution algorithm is carried out to eigenmatrix, one-dimensional convolution refers to corresponding for each row vector dot product of eigenmatrix Parameter vector be convolution kernel, using different convolution kernels when different rows vector carries out dot product operations；In the effect of one-dimensional convolution Under, eigenmatrix is converted into and unit vector dimension identical vector, and the character representation of a certain window of the vector representation can be seen Into be active cell around under the influence of unit produced by semantic feature represent；

After being then passed through a linear net network layers, nonlinear conversion is carried out using Sigmoid or hardTanh functions, finally A linear layer is reused, the vector equal with task number of labels, the possibility of vectorial each element representation corresponding label is exported Property；

A sentence is given, with window slip from left to right, network exports a matrix, each element in matrixf _θ （t|i）Represent the in sentenceiIndividual unit belongs to labeltProbability estimation, whereinθRepresent the parameter of network；In sequence mark In note task, due to there is very strong dependence between in front and back's label, matrix is introducedA _ijRepresent from labeliJump to labelj's Probability；Given one containsnThe sentence of individual units _[1:n], it is certain isometric sequence labelt _[1:n]Carry out estimating point：

Score（s _[1:n],t _[1:n],θ）=（Formula 1）

In the case where network parameter is given, a score value highest sequence label is obtained as mark using Viterbi decoding algorithm Note result；

（Formula 2）

Wherein：（s, t）A sample in expression training set；Training adopts gradient descent method, and all parameters of network are using following Formula is updated：

（Formula 3）

Wherein：λRepresent Learning Step；

In 3 right side local derviation of computing formula, using Perceptron-style algorithms, i.e., the direction of parameter adjustment is only calculated, And its size is all fixed value 1, concrete calculating process is as follows：Compare the annotated sequence and just of top score under current network parameter Really annotated sequence, such as inconsistent, and in the inconsistent position of generation, the local derviation for arranging the outgoing position for causing mistake annotated sequence is 1, and the outgoing position local derviation of the correct annotated sequence of correspondence is+1；Same local derviation computational methods are also applied for transfer matrixA _ij； In combination with using Max-margin methods when model parameter is trained, i.e., the highest scoring of correct annotated sequence is not required nothing more than, and And require that its score value exceedes the threshold value for specifying with the difference of incorrect annotated sequence top score；

If carrying out network training to the network structure and accelerating algorithm of label information before adopting comprehensively, concrete methods of realizing is：

By vector table, each label is also corresponded to into a vector representation, and by each possible label vector with The eigenmatrix of current window is carried out side by side, is carried out similar one-dimensional convolution afterwards and is represented producing corresponding semantic feature；Net Network is that each possible forward direction label of each sentence component units exports the vector equal with task number of labels, vector Each element equally represent the probability of corresponding label；Conjugative tiansfer matrixA _ij, using Viterbi decoding algorithm, obtain one Score value highest sequence label is used as annotation results；

Window semantics character representation under assuming to label before difference is calculated can be shared to intermediate result, so as to accelerate Network calculations speed, concrete accelerated method is：The part of its overlap is calculated first, i.e., tie in the middle of during label before not considering Really；Then calculate the part that different label vectors affect；Intermediate result is drawn into final meter plus the part that label affects finally Calculate result；

（5）After newly-increased and extension language material, parameter is adjusted using identical training algorithm on training network parameter basis, Or complete re -training network；Concrete training method is shown in step（4）It is described；

（6）After network training terminates, in the case where network parameter is given, using Viterbi decoding algorithm, a score value is obtained Result of the highest sequence label as mark.