CN109086267B

CN109086267B - Chinese word segmentation method based on deep learning

Info

Publication number: CN109086267B
Application number: CN201810756452.0A
Authority: CN
Inventors: 王传栋; 史宇; 李智
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2022-07-26
Anticipated expiration: 2038-07-11
Also published as: CN109086267A

Abstract

The invention discloses a Chinese word segmentation method based on deep learning, which comprises the following steps: mapping the Chinese characters into literal vectors based on the literal frequency; the literal vector is refined, and a feature vector carrying context semantic information and a feature vector carrying literal features are extracted; the character level vectors are effectively fused into distributed expression of word levels, the fused candidate word vectors are sent into a deep learning model to calculate sentence scores, a cluster searching method is used for decoding, and finally a proper word segmentation result is selected according to the sentence scores. Therefore, the word segmentation task is liberated from complicated characteristic engineering, better system performance can be obtained by extracting richer characteristic information, and modeling is performed by using complete segmentation history, so that the word segmentation task has the word segmentation capability of sequence level.

Description

Chinese word segmentation method based on deep learning

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Chinese word segmentation method based on deep learning.

Background

Under the current big data environment, with the rapid development of data perception of the Internet of things, data cloud computing, three-network integration and mobile internet, the data volume of data, especially unstructured text, rapidly increases in an exponential level, and the data volume has the characteristics of diversified types, isomerization, information fragmentation, low value density and the like. The rapid expansion of data brings great challenges to the automatic Processing of information, and how to efficiently and accurately process massive texts and extract valuable information becomes an important subject of Natural Language Processing (NLP).

In the field of natural language processing, particularly in Chinese natural language processing, word segmentation is an important reference task, and the quality of the result performance directly affects the final performance of the following pragmatic tasks such as machine translation, emotion analysis, automatic abstract generation, information retrieval and the like. However, due to the syntax and grammar particularity of chinese, applying english and other languages directly to chinese does not achieve the desired effect. The traditional Chinese word segmentation method is divided into two types, namely character string matching based and statistics based, wherein the words based on the character string matching scan sentences according to a certain rule, word bases are searched one by one for word segmentation, and the statistical method utilizes a statistical language model and an unsupervised or semi-supervised learning algorithm to obtain an optimal segmentation result. Although such methods have certain effects, most of them are specific field tasks, and require strong manual intervention to perform feature discovery, and such intervention not only results in complex run-time dependency on dictionaries, but also requires researchers to have professional linguistic knowledge.

Deep learning can automatically learn data representation by utilizing a deep neural network, a unified internal representation with stronger decision making capability, insight discovery capability and process capability is constructed for data, unified understanding of data facts is formed, dimensionality of distributed vectors is reduced on the basis of keeping semantic information, training duration is greatly reduced, and system performance is improved.

In the early Chinese word segmentation task based on deep learning, each word in a training sequence is labeled by using a simple feedback neural network, and the method only acquires context information in a fixed window and cannot well learn the association between data and previous data.

The recurrent neural network can automatically learn more complex characteristics by accumulating historical memory and fully utilizing the context, but in practice, the recurrent neural network has the problems of gradient explosion and gradient disappearance, which makes the recurrent neural network face the problem that long-distance historical memory cannot be well dealt with.

In view of this, there is a need to provide a method for Chinese word segmentation based on deep learning to solve the above problems.

Disclosure of Invention

The invention aims to provide a Chinese word segmentation method based on deep learning, which has the word segmentation capability of sequence level.

In order to achieve the purpose, the invention adopts the following technical scheme: a Chinese word segmentation method based on deep learning comprises the following steps:

s1, performing word frequency statistics on the large-scale corpus D, initializing each word in the corpus D into a word vector based on a continuous bag-of-words model and a hierarchical normalization training method, and storing the obtained word vector into a dictionary V according to indexes;

s2, converting the training corpus sentence by sentence into vectors with fixed length, sending the vectors into a deep learning model, and refining and updating the literal vectors in the dictionary V to obtain feature vectors carrying context semantics and vectors containing literal features;

and S3, for each training sentence, when training word by word, segmenting all candidate words ending with the current target word according to the preset maximum word length, fusing the refined feature vectors into word vectors of each candidate word, connecting the candidate words with the previous word segmentation history in an increasing mode, and performing dynamic word segmentation by using a cluster searching method.

As a further improved technical solution of the present invention, step S1 specifically includes:

s11, extracting the basic features of each Chinese character, and constructing a dictionary V by traversing the corpus set D, wherein the dictionary V records the face, the word frequency and the corresponding word embedding vector of the word facing the training corpus;

s12, constructing a complete Huffman tree based on word frequency by the dictionary V, wherein words in the dictionary V are all positioned at leaf nodes of the tree, and establishing a quick index and lookup mechanism through an auxiliary hash table;

s13, initializing each character in the corpus D into a literal vector based on the continuous bag-of-words model and the concept of hierarchical normalization, and constructing to obtain a system objective function

Wherein the target word omega is the center of the window, l ^ω As a path from the root node to the target word ω, d ^ω For encoding from the root node to the target word omega, x _ω Is the mean value of the context literal vector within the window of the target word omega,

to calculate the mean value x of the literal vector of the context _ω The parameter values carried by the current branch node;

s14, defining the traversal of path nodes as an iteration cycle, and training inIn one iteration cycle with gradient

The parameter theta is overlapped and the semantic influence factor is accumulated

After an iteration cycle is finished, updating each context environment literal vector in the window of the target word omega, and the environment literal vector of the target word omega

Comprises the following steps:

where μ represents the learning rate.

As a further improved technical solution of the present invention, step S2 specifically includes:

s21, for the current sequence time t, executing the lookup table operation from the dictionary V according to the index to obtain the initial vector of the target word omega

S22, for the window context of the target word omega, the initial vector of the context in the window of the target word omega is taken out according to the index

L is more than or equal to 1 and less than or equal to w, w represents the window width, and the environment word vector in the window is expressed into a combined vector by using a threshold combined neural network method and is marked as

S23, for the current sequence time t, using a threshold combination neural network method, and calculating to obtain the hidden selective historical output of the previous 1-t-1 time according to the method of the step S22

Computing the selective future output of the hidden state at the time t + 1-n simultaneously

S24, at the time t of the current sequence, using the initial vector of the target word omega

And a combined vector

As input, respectively fed into deep learning model to generate historical characteristic output

And future feature output

S25, outputting the historical characteristics at the current sequence time t

And future feature output

Are linearly combined to form h _t ，

Generating network output using tanh activation function

Output the network

Updating the word vector into a dictionary V to obtain a refined word vector integrated with context semantics

Wherein, W ^(o) ∈R ^d*2d And an offset vector b ^(o) ∈R ^d Sharing parameters at each current sequence moment;

s26, obtaining the refined word vector merged with context semantics by using training set beta sentence by sentence and word by word table look-up

Constructing sentence matrix expression as an observation state matrix, using a viterbi algorithm to develop iterative training, defining a sentence scoring formula, and determining an optimal labeling sequence:

wherein, A _yi,yi+1 For the state transition matrix, the output label sequence selection uses a { BIES } label rule set, wherein B represents the first character of a word, I represents the middle character of a word, E represents the last character of a word, and S represents word formation, and the { BIES } label rule set is combined with the part-of-speech label to obtain the optimal label transition matrix in the character sequence

As a further improved technical scheme of the invention, the threshold combined neural network method comprises the following steps:

step 1, defining w character vectors to be combined, wherein the w character vectors are v ₁ ,v ₂ …v _w Wherein v is ₁ ,v ₂ …v _w ∈R ^d Defining a weight matrix W ^(r) ∈R ^d*d And an offset vector b ^(r) ∈R ^d For sharing parameters, a reset gate r is defined _l By resetting the gate r _l Calculating the probability of combined memory, resetting the gate r _l The calculation formula of (2) is as follows:

r _l ＝σ(W ^(r) ·v _l +b ^(r) )

wherein l is more than or equal to 1 and less than or equal to w;

step 2, in the character combination, a reset gate r is used _l Calculating respective character vectors v ₁ ,v ₂ …v _w Semantic features resulting from clustering into target words

Semantic features

The calculation formula of (2) is as follows:

wherein, the weight matrix W ^(l) ∈R ^d*d And an offset vector b ^(l) ∈R ^d Is a sharing parameter;

step 3, defining an update gate z _l (1 is more than or equal to l is less than or equal to w +1) is a normalized vector with d dimensions and is used for expressing and fusing each character vector v ₁ ,v ₂ …v _w And semantic features

Update probability of, update gate z _l Comprises the following steps:

wherein a factor matrix W is used ^(z) ∈R ^d*d As a sharing parameter;

step 4, utilizing the updated door z _l For character vector v ₁ ,v ₂ …v _w And semantic features

Performing selective mixing and combining process, aggregating into words and obtaining fixed length vector v _w Wherein v is _w ∈R ^d ，v _w The calculation formula of (c) is:

wherein l is more than or equal to 1 and less than or equal to w + 1.

As a further improved technical solution of the present invention, step S22 specifically includes:

s22.1 definition of reset Gate r _l Calculating the influence probability on the target word omega:

wherein l is more than or equal to 1 and less than or equal to W, and weight matrix W ^(r) ∈R ^d*d And an offset vector b ^(r) ∈R ^d Sharing parameters among the character vectors;

s22.2, using reset gate r _l Calculating semantic features of the aggregate influence of each character vector on the target word omega in the window

Wherein, the first and the second end of the pipe are connected with each other,

an initial vector for a context within a window of the target word;

s22.3, defining an update gate z _l (1 is more than or equal to l is less than or equal to w +1) is a normalized vector with d dimensions and is used for expressing and fusing each character vector v ₁ ,v ₂ …v _w And semantic features

Update probability of, update gate z _l Comprises the following steps:

wherein the factor matrix W ^(z) ∈R ^d*d Is a sharing parameter;

s22.4, utilizing the update gate z _l Initial vector fusing context within omega window of target word

And semantic features

Through selective mixing and combination processing, the combination vector is obtained through aggregation

Wherein l is more than or equal to 1 and less than or equal to w.

As a further improved technical solution of the present invention, in step S24, the method includes the following steps after the deep learning model is input:

step A1, performing a fitting on the current input v using a threshold combining neural network method _t The combined calculation of the character sequence in the window context environment to obtain the combined vector of the window context, and the combined vector is recorded as

Step A2, using a threshold-combining neural network method, performing a time alignment of the current sequenceAnd (4) performing combined calculation on all historical hidden state outputs before t to obtain historical hidden state outputs which are recorded as

Step A3, defining a reset gate r at each current sequence time t of sequence traversal _t Computing historical hidden state outputs

For the current input v _t Resulting memory probability, reset gate r _t The calculation formula of (2) is as follows:

wherein, the weight matrix W ^(r) ∈R ^d*d And an offset vector b ^(r) ∈R ^d Sharing parameters at each current sequence moment;

step A4, at each current sequence time t of the sequence traversal, for the current input v _t Defining an update Gate z _t To calculate the output in the history hidden state

Under the action of (2), the combined vector of the window context

Influence the resulting update probability, update gate z _t The calculation formula of (2) is as follows:

wherein, the weight matrix W ^(z) ∈R ^d*d And an offset vector b ^(z) ∈R ^d Sharing parameters at each current sequence moment;

step A5, at each current sequence time t of the sequence traversal, for the current input v _t By resetting the gate r _t Enhancing output in historical hidden state

Under-action combined vector

Influencing the value of the energy produced

Wherein, the weight matrix W ^(c) ∈R ^d*d And an offset vector b ^(c) ∈R ^d Sharing parameters at each current sequence moment;

step A6, at each current sequence time t of the sequence traversal, by updating the gate z _t Calculating the current input v _t Hidden state output of received history

Hidden state output under influence h _t ，

Wherein z is _t To update the door.

As a further improved technical solution of the present invention, step S3 specifically includes:

s31, at each current sequence moment t, segmenting all candidate words ending with the current target word according to the preset maximum word length, and for each candidate word, looking up a table to obtain the feature vector of each word in the candidate words

And corresponding label transfer vector

Linearly combined into a character vector v _l ；

Wherein L is more than or equal to 1 and less than or equal to L, and L is the number of characters contained in the current candidate word;

s32, using threshold combination network method to make character vector v contained in candidate word _l One candidate word vector is fused

S33, selecting the word vector from the fused candidate word vector

Calculating with a shared weight vector parameter u inner product to obtain a word score

S34, vector of candidate words

Sending the candidate words into a deep learning model, obtaining the historical characteristics of the current candidate words through coding, and performing historical reference

Using a cluster searching algorithm, according to a preset cluster width k, at each current sequence time t of sentence forward traversal, all the time, recording and storing k historical segmentations with better scores, wherein the hidden state output of segmentation sentence end words is h _t ；

S35, calculating at each current sequence time t of sequence traversalOutput h in a hidden state _t ；

S36, to produce h _t As input, the tanh activation function is used for providing prediction for candidate participles possibly generated at the moment t +1, and the value P is predicted _t+1 Comprises the following steps:

P _t+1 ＝tanh(W ^(p) ·h _t +b ^(p) )

wherein, W ^(p) ∈R ^d*d And an offset vector b ^(p) ∈R ^d Sharing parameters at each current sequence moment;

s37, vector of candidate words

Inputting the data into a deep learning model, and outputting h based on the deep learning model _t Calculating to obtain a prediction P for the next word _t+1 The deep learning model can acquire memory information in the whole previous word segmentation history and calculate the link score linkScore (y) of a sequence _t+1 )：

linkScore(y _t+1 )＝P _t+1 ·y _t+1 ；

S38, setting bundle width k, keeping k results with highest score in each step, continuing operation of new input on the kept segmentation, directly modeling word results by using complete segmentation history, defining word sequence y [1: m ] generated by deep learning model prediction, and constructing a score function of the segmented word sequence as follows:

s39, giving a given character sequence x _i Is represented as y ⁱ And defining a structured interval loss for predicting the segmented sentences to construct a loss function and updating parameters reversely. .

As a further improved technical solution of the present invention, step S35 specifically includes:

s35.1, defining a reset gate r at each current sequence time t of sequence traversal _t CalculatingHistory hidden state output

For the current input candidate word vector

Resulting memory probability:

wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

s35.2, at each current sequence moment t of sequence traversal, inputting the current candidate word vector

Defining an update gate z _t Computing output in historical hidden states

The update probability generated under the action of (2);

wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

s35.3, at each traverse of the sequenceAt the moment t of the current sequence, for the current input candidate word vector

By resetting the gate r _t Enhancing output in historical hidden state

Energy value produced under action

Wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

s35.4, at each current sequence time t of sequence traversal, updating the gate z _t Calculating the current input candidate word vector

Hidden state output of received history

Hidden state output under influence h _t ：

As a further improved technical solution of the present invention, step S39 specifically includes the following steps:

s39.1, for a given training sentence sequence sen [1: n ], generating a word sequence y [1: m ] through model prediction, wherein a word sequence score function after word segmentation is as follows:

s39.2, training by adopting a maximum interval method, and giving a training sentence sequence sen [1: n ]]The correct sequence of parts-of-words is denoted y ^(i)，t The sequence of parts of words predicted by the model is expressed as

The structured interval loss is defined as:

wherein μ is an attenuation parameter;

s39.3, a training set beta is given, a loss function of 2 norm terms is added, and parameters are updated reversely through the loss function:

wherein:

the beneficial effects of the invention are: the method comprises the steps of initializing a word face vector of each training word, capturing historical characteristics, future characteristics and word characteristics carried by the training words by using a deep learning model, refining a distributed vector based on a threshold combination neural network method to represent corresponding candidate words, reformulating Chinese word segmentation into a direct segmentation learning task, directly evaluating the relative possibility of different word segmentation sentences, and then searching a word segmentation sequence with the highest score, thereby obtaining the word segmentation capability of more sequence levels.

Drawings

FIG. 1 is a flow chart of the deep learning-based Chinese word segmentation method of the present invention.

Fig. 2 is a schematic diagram of a threshold combining neural network method according to the present invention.

FIG. 3 is a schematic diagram of an improved LTSM model according to the present invention.

FIG. 4 is a diagram of the radial refinement architecture of a CRF layer in accordance with the present invention.

Fig. 5 is a schematic diagram of a bundle search algorithm based on dynamic programming according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and detailed description.

As shown in fig. 1, a method for chinese word segmentation based on deep learning includes the following steps:

s1, performing word frequency statistics on the large-scale corpus D, initializing each word in the corpus D into a word vector based on continuous word bag (CBOW) and hierarchical normalization (HS) training methods, and storing the obtained word vector into a dictionary V according to indexes;

s2, converting the training corpus sentence by sentence into vectors with fixed length, sending the vectors into a deep learning model, wherein the deep learning model selects an improved LSTM model, and carries out refinement updating on the face vectors in the dictionary V to obtain feature vectors carrying context semantics and vectors containing word features;

Wherein, step S1 specifically includes the following steps:

s13, initializing each character in the corpus D into a literal vector based on the continuous bag-of-words model and the idea of hierarchical normalization, and constructing to obtain a system objective function

Wherein the target word omega is the center of the window, l ^ω As a path from the root node to the target word ω, d ^ω For encoding from the root node to the target word omega, x _ω As the mean of the context literal vectors within the window of the target word omega,

to calculate the mean value x of the literal vector of the context _ω Parameter values carried at the current branch node;

s14, defining the traversal of path nodes as an iteration cycle, training in the iteration cycle with gradient

After an iteration cycle is finished, updating each context environment literal vector in the window of the target word omega and the environment literal vector of the target word omega

Comprises the following steps:

wherein mu represents the learning rate, and the proper adjustment is made according to the gradient change rate in the training.

Wherein, step S13 specifically includes: inputting the corpora in the corpus set D according to sentences, traversing the training sentence S in sequence by using a dynamic variable window, taking the center of the window as a target word omega, forming context (omega) of the target word omega by other words in the window, and pre-counting a path l from a root node to the target word omega in a Huffman tree for each training sample (omega | context (omega)) ^ω And coding d ^ω Training takes traversal of path nodes as an iteration cycle, and takes context literal vector mean value x in omega window of target word _ω For input, the parameter θ is overlapped by a gradient descent method by using a sigmoid activation function as shown in the following formula, and the semantic influence of the context (ω) on the target word ω is calculated:

in the traversal, each branch node on the path is regarded as an implicit binary classifier, and each component in the Huffman coding is judged by judging

Is 1 (left subtree node) or 0 (right subtree node), calculates the context vector mean x _ω Parameter values carried at the current branch node

Under the action of the data processing method, influence factors on the omega semantics of the target words are constructed to obtain a system target function

Wherein the content of the first and second substances,target word omega is the center of the window l ^ω Is a path from the root node to the target word ω, d ^ω For encoding from the root node to the target word omega, x _ω Is the mean of the context literal vectors within a window of the target word ω.

Step S2 specifically includes the following steps:

S22, for the window context of the target word omega, the initial vector in the window of the target word omega context is taken out according to the index

S23, for the current sequence time t, using a threshold combined neural network method, and calculating to obtain the hidden selective historical output of the previous 1-t-1 time according to the method of the step S22

As an input to the forward modified LTSM; computing the selective future output of the hidden state at the time t + 1-n simultaneously

As an input to the backward modified LTSM.

And a combined vector

As input, respectively sending into Bi-LSTM architecture constructed by improved LSTM model for expansion training, and generating historical characteristic output

And future feature output

S25, outputting the historical characteristics at the current sequence time t

And future feature output

Are linearly combined to form h _t ，

Generating network output using tanh activation function

Output the network

s26, using training set beta sentence by word by table look-up to obtain refined word vector merged into context semantic

Sending the sentence matrix into a CRF layer to construct a sentence matrix expression as an observation state matrix, developing iterative training by using a viterbi algorithm, defining a sentence scoring formula, and determining an optimal labeling sequence:

wherein, A _yi,yi+1 Is a state transition matrix;

the choice of the bitwise tag in the output token sequence uses the { BIES } tag rule set, where B denotes the first word of a word, I denotes the middle word of a word, E denotes the last word of a word, and S denotes word-by-word, the { BIES } tag rule set is combined with the part-of-speech tag (e.g., S-V denotes the verb of a word-by-word) to obtain the optimal tag transfer matrix in the character sequence

And marking the optimal label selection of the characters in the training sentence by taking the optimal label transfer matrix as the character characteristic basis of the subsequent word segmentation.

Wherein, the combined vector is obtained in step S22

The method comprises the following specific steps:

Wherein the content of the first and second substances,

an initial vector for the context within the window of the target word omega.

S22.3, define the update Gate z _l (1 ≤ l ≤ w +1) is normalized vector with d dimension for expressing and fusing each character vector v ₁ ,v ₂ …v _w And semantic features

Update probability of, update gate z _l Comprises the following steps:

for the initial vector of the context within the window of the target word, the factor matrix W ^(z) ∈R ^d*d Is a sharing parameter;

And semantic features

Through selective mixing and combining treatment, the combination vector is obtained through aggregation

Wherein l is more than or equal to 1 and less than or equal to w.

As shown in fig. 2, the threshold combining neural network method mentioned in steps S22 and S23 includes the steps of:

step 1, defining w character vectors needing to be combined in a character group, wherein the w character vectors are v ₁ ,v ₂ …v _w Wherein v is ₁ ,v ₂ …v _w ∈R ^d Defining a weight matrix W ^(r) ∈R ^d*d And an offset vector b ^(r) ∈R ^d For sharing parameters, a reset gate r is defined _l By resetting the gate r _l Calculating the probability of combined memory, resetting the gate r _l The calculation formula of (2) is as follows:

r _l ＝σ(W ^(r) ·v _l +b ^(r) )

wherein l is more than or equal to 1 and less than or equal to w;

step 2, in the character combination, a reset gate r is used _l Calculating each character vector v ₁ ,v ₂ …v _w Semantic features resulting from clustering into target words

Semantic features

The calculation formula of (2) is as follows:

step 3, defining an update gate z _l (1 ≦ l ≦ w +1) as a d-dimensional normalized vector for expressing each of the fusion partnersCharacter vector v ₁ ,v ₂ …v _w And semantic features

Update probability of, update gate z _l Comprises the following steps:

wherein a factor matrix W is used ^(z) ∈R ^d*d As a sharing parameter;

Selectively mixing and combining to obtain fixed-length vector v _w Wherein v is _w ∈R ^d ，v _w The calculation formula of (2) is as follows:

wherein l is more than or equal to 1 and less than or equal to w + 1.

As shown in fig. 3, the improved LSTM model in step S24 is used for capturing historical memory information in the sequential traversal of the sequence labeling problem, and includes the following steps:

step A1, performing a fitting on the current input v using a threshold combining neural network method _t The combination calculation of the character sequence in the window context environment is carried out to obtain the combination vector of the window context, and the combination vector is marked as

Step A2, using a threshold combination neural network method to perform the combination calculation of all historical hidden state outputs before the current sequence time t to obtain the historical hidden state output which is recorded as

Step A3, at each current sequence time t of the sequence traversal, defines a reset gate r _t Computing historical hidden state outputs

For the current input v _t Resulting memory probability, reset gate r _t The calculation formula of (c) is:

Under the action of (2), the combined vector

Under-action combined vector

Influencing the value of the energy produced

step A6, at each current sequence time t of the sequence traversal, by updating gate z _t Calculating the current input v _t Hidden state output of received history

Hidden state output h under influence _t ，

In addition, in step S24, in the forward modified LSTM process, first, a reset gate is defined

Computing historical hidden state outputs

For the current input

Generated memory probability, reset gate

Comprises the following steps:

then, an update gate is defined

Computing output in historical hidden state

Under the action of (2), is subjected to combined vector

Influence the generated update probability, update gate

Comprises the following steps:

then, for the currently input target word

By resetting the door

Enhancing output in historical hidden state

Under-action combined vector

Influencing the value of the energy produced

finally, the door is renewed

Mainly determines how much information needs to be forgotten and kept, thereby calculating the current input target word

Hidden state output of received history

And implicit output under the influence of window context

The backward modified LSTM process is similar to the forward process, first, a reset gate is defined

Computing future hidden state outputs

For the current input

Generated prediction probabilityResetting door

Comprises the following steps:

then, an update gate is defined

Computing the incoming implicit output

Under the action of (2), the combined vector

Influence the generated update probability, update gate

Comprises the following steps:

then, for the currently input target word

By resetting the door

Enhancing future predictions

Under-action combined vector

Influencing the value of the energy produced

finally, the door is renewed

Hidden state output under influence of future prediction and window context

Step S3 specifically includes:

s31, at each current sequence moment t, segmenting all candidate words ending with the current target word according to the preset maximum word length, and for each candidate word, looking up a table to obtain a feature vector of each word in the candidate words with context semantics

And corresponding label transfer vector

Linearly combined into a character vector v _l ；

S33, fusing the candidate word vectors

S34, vector of candidate words

Sending into improved LSTM model, encoding to obtain current candidate word history characteristics, and history reference

Using a cluster searching algorithm, according to a preset cluster width k, at each current sequence time t of sentence forward traversal, always recording and storing k historical segmentations with better scores, wherein the hidden state output of segmentation sentence end words is h _t ；

S35, at each current sequence time of sequence traversalt, calculating the hidden state output h _t ；

P _t+1 ＝tanh(W ^(p) ·h _t +b ^(p) )

s37, vector of candidate words

Inputting the output h into the improved LSTM model based on the output h of the improved LSTM model _t Calculating to obtain the prediction P of the next word _t+1 The improved LSTM model can acquire memory information in the whole previous word segmentation history and calculate the link score linkScore (y) of a sequence _t+1 )：

linkScore(y _t+1 )＝P _t+1 ·y _t+1 ；

S38, setting bundle width k, keeping k results with highest score at each step, continuing operation of new input on the kept segmentation, directly modeling word results by using complete segmentation history, defining word sequence y [1: m ] predicted and generated by an improved LSTM model, and constructing a score function of the segmented word sequence as follows:

s39, giving a given character sequence x _i Is represented as y ⁱ A structured interval loss for predicting segmented sentences is defined.

Wherein, step S35 specifically includes:

s35.1, defining a reset gate r at each current sequence time t of sequence traversal _t Computing historical hidden state output

For the current input candidate word vector

Resulting memory probability:

wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

Defining an update Gate z _t Computing output in historical hidden states

The update probability generated under the action of (2);

wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

s35.3, at each current sequence moment t of sequence traversal, inputting the current candidate word vector

By resetting the gate r _t Enhancing output in historical hidden state

Energy value produced under action

Wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

Hidden state output of received history

Hidden state output h under influence _t ：

Step S39 specifically includes the following steps:

s39.2, training by adopting a maximum interval method, and giving a training sentence sequence sen [1: n ]]The correct sequence of parts-of-words is denoted y ^(i),t The word sequence predicted by the model is expressed as

Define structured interval loss as:

wherein mu is an attenuation parameter used for adjusting the loss function value;

wherein:

example 1

A Chinese word segmentation method based on deep learning comprises the following steps:

step 1: performing literal word frequency statistics on the large-scale corpus D, initializing each word in the corpus D into a basic distributed literal vector based on a CBOW model and an HS training method, and storing the obtained literal vector into a dictionary V according to indexes.

And 2, step: converting the training corpus into vectors with fixed length sentence by sentence, sending the vectors into an improved bidirectional LSTM model, and refining and updating the literal vectors at the character level in the dictionary V through training the parameters in the bidirectional LSTM model to obtain the feature vectors carrying context semantics and the vectors containing the literal features.

And 3, step 3: when each training sentence is trained word by word, all candidate words ending with the current word are segmented within the maximum word length range by using a full segmentation idea, the refined character level feature vectors are fused into a word vector of each candidate word, the candidate words are incrementally connected with the previous word segmentation history, and dynamic word segmentation is performed by using a cluster searching method.

Specifically, the basic features of each Chinese character are extracted in the first step, a dictionary V is constructed for the Chinese characters in the corpus D by traversing the large-scale corpus D, and the dictionary V records the face, the word frequency and the corresponding word embedding vector representation of the Chinese characters facing the training corpus. A complete Huffman tree is built for the dictionary V based on word frequency, words in the dictionary V are located at leaf nodes of the tree, and a quick indexing and searching mechanism is built through an auxiliary hash table. Training of literal vectors Using the CBOW model and HS training methods, a system objective function is constructed as shown below:

inputting the training corpus in corpus set D according to sentence, traversing the training sentence S in sequence by using a dynamic variable window, wherein the center of the window is a target character omega, and training in an iteration period by gradient

The parameter theta is overlapped, and the semantic influence factors are accumulated at the same time

After the end of an iteration cycle, the literal vectors for each context within the window of the target word ω are updated by:

wherein mu represents the learning rate, and the training is properly adjusted according to the gradient change rate.

Secondly, training each input training sentence word by word, taking out a corresponding initial vector from the dictionary V according to the index, and recording the initial vector as the initial vector

Taking 50 as vector dimension d in training, taking out word vectors in a corresponding context window according to the index, combining the word vectors in the context window into a combined vector by using a threshold combined neural network, and recording the combined vector as the combined vector

The effect of the window context on the target word is expressed.

To vector the character face

Combining vectors with window context

The input into the improved Bi-LSTM model is that the input of forward LSTM is the sequence passing from left to right, and the input of backward LSTM is that the sequence passes from right to left. And finally, splicing the outputs of the two hidden layer units to serve as the output of the whole network hidden layer.

And outputting a label T, wherein the lexeme label selection uses a { BIES } mark rule set, wherein B represents a first word of a word, M represents a middle word of the word, E represents a last word of the word, and S represents that the word is formed by a single word, and S is combined with a part-of-speech tag, for example, S-V represents a verb of the word formed by the single word, the part-of-speech tags are 13 in number, and the part-of-speech tag combination label set is 52 in number. And carrying out mapping transformation on the hidden layer output, and carrying out nonlinear transformation as output.

o _i ＝tanh(w _o ·[h _i1 ,h _i2 ]+b _o )

The sentence score formula is defined as:

where θ is the set of model parameters, A _yi,yi+1 Is the transition state matrix and N is the number of words in the training sentence.

And (3) using a dynamic programming algorithm in the decoding process, and finally selecting the labeling sequence as the sequence with the highest calculation score:

wherein Y is _X For all possible annotation sequences. Normalizing all labels by using softmax, calculating a score as conditional label path probability, taking logarithm to the conditional label path probability to obtain a conditional probability likelihood function of an effective path, reversely training an update parameter and a literal vector

The network trains the parameters by maximizing the likelihood function of the sentence labels:

model output is sent to a CRF layer, the optimal labeling sequence is determined by using a viterbi algorithm, sentence label scores are formed by a transition state matrix and network output, and parameters and a literal vector are reversely updated by maximizing a likelihood function of the sentence labels

The trained literal vector is the characteristic vector carrying context information which needs to be extracted and is recorded as

In the state transition matrix, each row represents the possibility that all possible labels of the previous character are transferred to a certain character label of the current character, and the transfer vector corresponding to the optimal label sequence is extracted as a characteristic vector carrying character characteristics and recorded as the characteristic vector

The third step is word-taking operation, the training sentence set marked in the second step is traversed word by word, and for each word in the training sentences, the vector of the word is extracted through the operation of a lookup table

Dividing all candidate words ending with the current word in the maximum word length range by using a forward full segmentation idea according to a preset maximum word length (such as the word length value is 4), fusing and representing character level vectors into word level vectors based on the candidate words by using a threshold combination neural network (GCNN)

Wherein, L is the number of characters in the current candidate word;

the fused word level vector

And (3) inner product with a shared weight vector parameter u, calculating the score of the current candidate word:

vector the word level

Inputting the candidate words into an improved LSTM model, and obtaining historical characteristic representation of the current candidate words through encoding:

since the model performs full segmentation based on the maximum word length using dynamic programming, the number of all possible segmentation labeling results is an exponential order of the length of the character sequence, making it impossible to calculate the score of all possible results. In addition, because the calculation of the current candidate participle indicates that historical information of the candidate word is introduced, the traditional viterbi algorithm is no longer the best choice, and in order to play a better role in practical application, the model searches the hidden layer output h at the previous moment suitable for the candidate word by adopting the clustering algorithm _j-1 And decoding is carried out. Specifically, the detailed process of solving the optimization model by the cluster searching algorithm is as follows:

in the algorithm, each candidate word generated at each time t of the traversal training sentence is integrated with the matched word segmentation history memory through the improved LSTM model, and prediction is provided for candidate word segmentation possibly generated at the time t + 1:

P _t+1 ＝tanh(w _p h _t +b _p )

thereby calculating the connection score at the t +1 moment;

linkScore(y _t+1 )＝p _t+1 ·y _t+1

and through the preset bundling width k, the segmentation of k sentences with higher scores is always kept in the dynamic forward direction of each step, in the process, not only is the complete segmentation history effectively utilized, but also the possible segmentation in the future is predicted, so that the model has the sentence-level discrimination capability. Assuming that a certain word sequence y [1: m ] is generated through model prediction for a given training sentence sequence sen [1: n ], the word sequence score function after word segmentation is as follows:

the training uses the maximum Interval (max margin criterion) for a given sequence of training sentences sen [1: n ]]Its correct word segmentationSequence is represented as y ^(i),t The sequence of parts of words predicted by the model is expressed as

Defining structured interval loss:

wherein mu is an attenuation parameter used for adjusting the loss function value and playing a smoothing role. Given a training set beta, adding a loss function of 2 norm terms, and reversely updating parameters through the loss function:

wherein:

the Adagard algorithm is adopted for model optimization, and a dropout regularization technology is used for preventing overfitting.

In the present invention, FIG. 4 is a character-level Bi-LSTM-CRF vector refinement architecture diagram; fig. 5 is a schematic diagram of a bundle search algorithm based on dynamic programming.

In summary, the invention has the following advantages:

(1) the method has the advantages that the more abstract essential features of the data can be captured without engineering feature extraction and excessive prior knowledge, and the text is represented by distributed vectors;

(2) the improved Bi-LSTM model is used in the process of literal vector refinement, so that the character-level vector can acquire information in the front text and the subsequent text at the same time, and a method based on an attention mechanism can better acquire remote context information;

(3) the method comprises the steps of acquiring the character characteristics of a character vector in the character vector refinement process, and improving the accuracy of word segmentation by using character information to assist word segmentation;

(4) semantic information of words in a window around a target word is added in the process of training the refined word vector, so that the extraction of the vector characteristics of the target word is enhanced;

(5) in the word vector fusion method based on the gate-controlled convolutional neural network, the shared parameter matrix is used during calculation, so that the dimensionality of the parameter matrix needing to be trained is reduced, the model calculation efficiency is improved, and the character vectors can be more reasonably fused by using a gate control mechanism;

(6) a cluster searching algorithm: taking the k results with the highest scores at each step and keeping, and then continuing to operate on the kept segmentations by the newly segmented candidate words, compared with the viterbi algorithm, the method can save much decoding time and can directly model the segmentation results by using the complete segmentation history.

The above embodiments are only for illustrating the invention and not for limiting the technical solutions described in the invention, and the understanding of the present specification should be based on the technical personnel in the field, and although the present specification has described the invention in detail with reference to the above embodiments, the technical personnel in the field should understand that the technical personnel in the field can still make modifications or equivalent substitutions to the present invention, and all the technical solutions and modifications thereof without departing from the spirit and scope of the present invention should be covered in the claims of the present invention.

Claims

1. A Chinese word segmentation method based on deep learning is characterized by comprising the following steps:

s2, converting the training corpus sentence by sentence into vectors with fixed length, sending the vectors into a deep learning model,

the literal vector in the dictionary V is refined and updated, and a feature vector carrying context semantics and a vector containing a literal feature are obtained;

s3, for each training sentence, when training word by word, according to the preset maximum word length, dividing

Fusing the refined feature vectors into word vectors of each candidate word for all candidate words ending with the current target word, connecting the candidate words with the previous word segmentation history in an increasing mode, and performing dynamic word segmentation by using a cluster searching method;

wherein, step S1 specifically includes:

Wherein the target word omega is the center of the window, l ^ω As a path from the root node to the target word ω, d ^ω For encoding from the root node to the target word ω, x _ω Is the mean value of the context literal vector within the window of the target word omega,

to calculate the mean value x of the literal vector of the context _ω The parameter value carried at the current branch point;

Comprises the following steps:

where μ represents the learning rate.

2. The method for Chinese word segmentation based on deep learning of claim 1, wherein the step S2 specifically comprises:

S22, extracting the initial vector of the context in the omega window of the target word according to the index

w represents the window width, and the environment word vector in the window is expressed into a combined vector by using a threshold combined neural network method and is marked as

And a combined vector

And future feature output

S25, outputting the historical characteristics at the current sequence time t

And future feature output

The linear combination is performed to form the ht,

generation using tanh activation functionNetwork output

Output the network

Updating the word into a dictionary V to obtain a refined word vector integrated with context semantics

s26, using training set beta to obtain refined word vector merged with context semantics by sentence-by-word table look-up to construct sentence matrix expression as state matrix, using viterbi algorithm to develop iterative training, defining sentence scoring formula, and determining the best annotation sequence:

ayi, yi +1 is a state transition matrix, the output label sequence selection uses a { BIES } label rule set, wherein B represents the first character of a word, I represents the character in the middle of a word, E represents the last character of a word, S represents the word formation of a single character, and the { BIES } label rule set and part-of-speech label tags are combined to obtain the optimal label transition matrix in the character sequence

3. The deep learning-based chinese word segmentation method of claim 2, wherein the threshold-combined neural network method comprises the steps of:

step 1, defining w character vectors to be combined, wherein the w character vectors are v ₁ ，v ₂ ...v _w Wherein v is ₁ ，v ₂ ...v _w ∈R ^d Defining a weight matrix W ^(r) ∈R ^d*d And an offset vector b ^(r) ∈R ^d For sharing parameters, a reset gate r is defined _l By resetting the gate r _l Calculating the probability of combined memory, resetting the gate r _l The calculation formula of (c) is:

r _l ＝σ(W ^(r) ·v _l +b ^(r) )

wherein l is more than or equal to 1 and less than or equal to w;

step 2, in the character combination, a reset gate r is used _l Calculating each character vector v ₁ ，v ₂ ...v _w Semantic features resulting from aggregation into target words

Semantic features

The calculation formula of (2) is as follows:

step 3, defining an update gate z _l (1 ≤ l ≤ w +1) is normalized vector with d dimension for expressing and fusing each character vector v ₁ ，v ₂ ...v _w And semantic features

Update probability of, update gate Z _l Comprises the following steps:

wherein the factor matrix W ^(z) ∈R ^d*d Is a sharing parameter;

step 4, utilizing the update gate zl to pair the character vector v ₁ ，v ₂ ...v _w And semantic features

wherein l is more than or equal to 1 and less than or equal to w + 1.

4. The deep learning-based Chinese word segmentation method of claim 3, wherein the step S22 specifically comprises:

an initial vector for a context within a window of the target word;

s22.3, defining an updating gate zl (l is more than or equal to 1 and less than or equal to w +1) as a d-dimensional normalized vector for expressing and fusing each character vector v ₁ ，v ₂ ...v _w And semantic features

Update gate zl is:

wherein the factor matrix W ^(r) ∈R ^d*d Is a sharing parameter;

s22.4, fusing the initial vector of the context in the omega window of the target word by using the zl of the updating gate

And semantic features

Wherein l is more than or equal to 1 and less than or equal to w.

5. The method for Chinese word segmentation based on deep learning of claim 2, wherein the step of entering the deep learning model in step S24 comprises the following steps:

step A1, performing a pair using a threshold combining neural network methodThe combination calculation of the character sequence in the current input vt window context environment is carried out to obtain the combination vector of the window context, and the combination vector is recorded as

Step A3, defining a reset gate rt to calculate the historical hidden state output at each current sequence time t of sequence traversal

For the memory probability generated by the current input vt, the calculation formula for the reset gate rt is as follows:

step A4, at each current sequence time t of the sequence traversal, defining an update gate zt for the current input vt for calculating the output in the historical hidden state

Under the action of (2), the combined vector of the window context

The probability of an update being generated is influenced,

the calculation formula for the update gate zt is:

step A5, at each current sequence time t of sequence traversal, intensifying historical hidden state output of current input vt through reset gate rt

Under-action combined vector

Influencing the value of the energy produced

step A6, at each current sequence time t of sequence traversal, calculating current input vt by updating gate zt and outputting historical hidden state

The hidden state under influence is output as ht,

where zt is the update gate.

6. The method for Chinese word segmentation based on deep learning of claim 5, wherein the step S3 specifically comprises:

s31, at each current sequence time t, segmenting all the current target characters ending according to the preset maximum word length

For each candidate word, looking up a table to obtain the characteristic vector of each word in the candidate word

And corresponding label transfer vector

Linear combination is a character vector vl;

s32, character vectors vl contained in the candidate words are fused into a candidate word vector by using a threshold combination network method

S33, selecting the word vector from the fused candidate word vector

S34, vector of candidate words

Sending the data into a deep learning model, obtaining the historical characteristics of the current candidate words through coding, and referring to the historical characteristics

Using a cluster searching algorithm, according to a preset cluster width k, at each current sequence time t traversed by the sentence in the forward direction, always recording and storing k historical segmentations with better scores, wherein the hidden state output of the segmentation sentence end words is ht;

s35, calculating hidden state output ht at each current sequence time t of sequence traversal;

s36, taking the generated ht as an input, using a tanh activation function to provide prediction for candidate participles possibly generated at the t +1 moment, wherein a predicted value Pt +1 is as follows:

P _t+1 ＝tanh(W ^(p) ·h _t +b ^(p) )

s37, inputting the candidate word vector into a deep learning model, calculating and obtaining the prediction Pt +1 of the next word based on the output ht of the deep learning model, wherein the deep learning model can obtain the memory information in the whole word segmentation history before, and calculating the connection score linkScore (yt +1) of a sequence:

linkScore(yt+1)＝Pt+1·yt+1；

s38, setting bundle width k, keeping k results with highest score at each step, continuing operation of new input on the kept segmentation, directly modeling word results by using complete segmentation history, defining words generated by deep learning model prediction

And the sequence y [1: m ], wherein the segmented word sequence score function is constructed as follows:

s39, giving a given character sequence x _i Is represented as y ⁱ And defining a structured interval loss for predicting the segmented sentences to construct a loss function, and updating parameters reversely.

7. The method for Chinese word segmentation based on deep learning of claim 6, wherein the step S35 specifically comprises:

s35.1, defining a reset gate rt at each current sequence moment t of sequence traversal, and calculating historical hidden state output

For current input candidate word vector

Resulting memory probability:

wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

Defining an update gate zt, calculating the output in the history hidden state

The update probability generated under the action of (1);

wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

Strengthening output in history hidden state by resetting gate rt

Energy value produced under action

Wherein, the weight matrix

And an offset vector

Sharing parameters at each current sequence moment;

s35.4, at each current sequence moment t of sequence traversal, calculating the current input candidate word vector through an update gate zt

Hidden state output of received history

Hidden state output ht under influence:

8. the method for Chinese word segmentation based on deep learning of claim 6, wherein the step S39 specifically comprises the following steps:

s39.2, training by adopting a maximum interval method, and giving a training sentence sequence sen [1: n ]]The correct sequence of parts-of-words is denoted y ^(i)，t The word sequence predicted by the model is expressed as

The structured interval loss is defined as:

wherein μ is an attenuation parameter;

wherein: