CN109086267B - Chinese word segmentation method based on deep learning - Google Patents

Chinese word segmentation method based on deep learning Download PDF

Info

Publication number
CN109086267B
CN109086267B CN201810756452.0A CN201810756452A CN109086267B CN 109086267 B CN109086267 B CN 109086267B CN 201810756452 A CN201810756452 A CN 201810756452A CN 109086267 B CN109086267 B CN 109086267B
Authority
CN
China
Prior art keywords
word
vector
sequence
current
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810756452.0A
Other languages
Chinese (zh)
Other versions
CN109086267A (en
Inventor
王传栋
史宇
李智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201810756452.0A priority Critical patent/CN109086267B/en
Publication of CN109086267A publication Critical patent/CN109086267A/en
Application granted granted Critical
Publication of CN109086267B publication Critical patent/CN109086267B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Abstract

The invention discloses a Chinese word segmentation method based on deep learning, which comprises the following steps: mapping the Chinese characters into literal vectors based on the literal frequency; the literal vector is refined, and a feature vector carrying context semantic information and a feature vector carrying literal features are extracted; the character level vectors are effectively fused into distributed expression of word levels, the fused candidate word vectors are sent into a deep learning model to calculate sentence scores, a cluster searching method is used for decoding, and finally a proper word segmentation result is selected according to the sentence scores. Therefore, the word segmentation task is liberated from complicated characteristic engineering, better system performance can be obtained by extracting richer characteristic information, and modeling is performed by using complete segmentation history, so that the word segmentation task has the word segmentation capability of sequence level.

Description

Chinese word segmentation method based on deep learning
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese word segmentation method based on deep learning.
Background
Under the current big data environment, with the rapid development of data perception of the Internet of things, data cloud computing, three-network integration and mobile internet, the data volume of data, especially unstructured text, rapidly increases in an exponential level, and the data volume has the characteristics of diversified types, isomerization, information fragmentation, low value density and the like. The rapid expansion of data brings great challenges to the automatic Processing of information, and how to efficiently and accurately process massive texts and extract valuable information becomes an important subject of Natural Language Processing (NLP).
In the field of natural language processing, particularly in Chinese natural language processing, word segmentation is an important reference task, and the quality of the result performance directly affects the final performance of the following pragmatic tasks such as machine translation, emotion analysis, automatic abstract generation, information retrieval and the like. However, due to the syntax and grammar particularity of chinese, applying english and other languages directly to chinese does not achieve the desired effect. The traditional Chinese word segmentation method is divided into two types, namely character string matching based and statistics based, wherein the words based on the character string matching scan sentences according to a certain rule, word bases are searched one by one for word segmentation, and the statistical method utilizes a statistical language model and an unsupervised or semi-supervised learning algorithm to obtain an optimal segmentation result. Although such methods have certain effects, most of them are specific field tasks, and require strong manual intervention to perform feature discovery, and such intervention not only results in complex run-time dependency on dictionaries, but also requires researchers to have professional linguistic knowledge.
Deep learning can automatically learn data representation by utilizing a deep neural network, a unified internal representation with stronger decision making capability, insight discovery capability and process capability is constructed for data, unified understanding of data facts is formed, dimensionality of distributed vectors is reduced on the basis of keeping semantic information, training duration is greatly reduced, and system performance is improved.
In the early Chinese word segmentation task based on deep learning, each word in a training sequence is labeled by using a simple feedback neural network, and the method only acquires context information in a fixed window and cannot well learn the association between data and previous data.
The recurrent neural network can automatically learn more complex characteristics by accumulating historical memory and fully utilizing the context, but in practice, the recurrent neural network has the problems of gradient explosion and gradient disappearance, which makes the recurrent neural network face the problem that long-distance historical memory cannot be well dealt with.
In view of this, there is a need to provide a method for Chinese word segmentation based on deep learning to solve the above problems.
Disclosure of Invention
The invention aims to provide a Chinese word segmentation method based on deep learning, which has the word segmentation capability of sequence level.
In order to achieve the purpose, the invention adopts the following technical scheme: a Chinese word segmentation method based on deep learning comprises the following steps:
s1, performing word frequency statistics on the large-scale corpus D, initializing each word in the corpus D into a word vector based on a continuous bag-of-words model and a hierarchical normalization training method, and storing the obtained word vector into a dictionary V according to indexes;
s2, converting the training corpus sentence by sentence into vectors with fixed length, sending the vectors into a deep learning model, and refining and updating the literal vectors in the dictionary V to obtain feature vectors carrying context semantics and vectors containing literal features;
and S3, for each training sentence, when training word by word, segmenting all candidate words ending with the current target word according to the preset maximum word length, fusing the refined feature vectors into word vectors of each candidate word, connecting the candidate words with the previous word segmentation history in an increasing mode, and performing dynamic word segmentation by using a cluster searching method.
As a further improved technical solution of the present invention, step S1 specifically includes:
s11, extracting the basic features of each Chinese character, and constructing a dictionary V by traversing the corpus set D, wherein the dictionary V records the face, the word frequency and the corresponding word embedding vector of the word facing the training corpus;
s12, constructing a complete Huffman tree based on word frequency by the dictionary V, wherein words in the dictionary V are all positioned at leaf nodes of the tree, and establishing a quick index and lookup mechanism through an auxiliary hash table;
s13, initializing each character in the corpus D into a literal vector based on the continuous bag-of-words model and the concept of hierarchical normalization, and constructing to obtain a system objective function
Figure BDA0001726828550000039
Figure BDA0001726828550000031
Wherein the target word omega is the center of the window, l ω As a path from the root node to the target word ω, d ω For encoding from the root node to the target word omega, x ω Is the mean value of the context literal vector within the window of the target word omega,
Figure BDA0001726828550000032
to calculate the mean value x of the literal vector of the context ω The parameter values carried by the current branch node;
s14, defining the traversal of path nodes as an iteration cycle, and training inIn one iteration cycle with gradient
Figure BDA0001726828550000033
The parameter theta is overlapped and the semantic influence factor is accumulated
Figure BDA0001726828550000034
After an iteration cycle is finished, updating each context environment literal vector in the window of the target word omega, and the environment literal vector of the target word omega
Figure BDA0001726828550000035
Comprises the following steps:
Figure BDA0001726828550000036
where μ represents the learning rate.
As a further improved technical solution of the present invention, step S2 specifically includes:
s21, for the current sequence time t, executing the lookup table operation from the dictionary V according to the index to obtain the initial vector of the target word omega
Figure BDA0001726828550000037
S22, for the window context of the target word omega, the initial vector of the context in the window of the target word omega is taken out according to the index
Figure BDA0001726828550000038
L is more than or equal to 1 and less than or equal to w, w represents the window width, and the environment word vector in the window is expressed into a combined vector by using a threshold combined neural network method and is marked as
Figure BDA0001726828550000041
S23, for the current sequence time t, using a threshold combination neural network method, and calculating to obtain the hidden selective historical output of the previous 1-t-1 time according to the method of the step S22
Figure BDA0001726828550000042
Computing the selective future output of the hidden state at the time t + 1-n simultaneously
Figure BDA0001726828550000043
S24, at the time t of the current sequence, using the initial vector of the target word omega
Figure BDA0001726828550000044
And a combined vector
Figure BDA0001726828550000045
As input, respectively fed into deep learning model to generate historical characteristic output
Figure BDA0001726828550000046
And future feature output
Figure BDA0001726828550000047
S25, outputting the historical characteristics at the current sequence time t
Figure BDA0001726828550000048
And future feature output
Figure BDA0001726828550000049
Are linearly combined to form h t
Figure BDA00017268285500000410
Generating network output using tanh activation function
Figure BDA00017268285500000411
Output the network
Figure BDA00017268285500000412
Updating the word vector into a dictionary V to obtain a refined word vector integrated with context semantics
Figure BDA00017268285500000413
Figure BDA00017268285500000414
Wherein, W (o) ∈R d*2d And an offset vector b (o) ∈R d Sharing parameters at each current sequence moment;
s26, obtaining the refined word vector merged with context semantics by using training set beta sentence by sentence and word by word table look-up
Figure BDA00017268285500000415
Constructing sentence matrix expression as an observation state matrix, using a viterbi algorithm to develop iterative training, defining a sentence scoring formula, and determining an optimal labeling sequence:
Figure BDA00017268285500000416
wherein, A yi,yi+1 For the state transition matrix, the output label sequence selection uses a { BIES } label rule set, wherein B represents the first character of a word, I represents the middle character of a word, E represents the last character of a word, and S represents word formation, and the { BIES } label rule set is combined with the part-of-speech label to obtain the optimal label transition matrix in the character sequence
Figure BDA00017268285500000417
Figure BDA00017268285500000418
As a further improved technical scheme of the invention, the threshold combined neural network method comprises the following steps:
step 1, defining w character vectors to be combined, wherein the w character vectors are v 1 ,v 2 …v w Wherein v is 1 ,v 2 …v w ∈R d Defining a weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d For sharing parameters, a reset gate r is defined l By resetting the gate r l Calculating the probability of combined memory, resetting the gate r l The calculation formula of (2) is as follows:
r l =σ(W (r) ·v l +b (r) )
wherein l is more than or equal to 1 and less than or equal to w;
step 2, in the character combination, a reset gate r is used l Calculating respective character vectors v 1 ,v 2 …v w Semantic features resulting from clustering into target words
Figure BDA0001726828550000051
Semantic features
Figure BDA0001726828550000052
The calculation formula of (2) is as follows:
Figure BDA0001726828550000053
wherein, the weight matrix W (l) ∈R d*d And an offset vector b (l) ∈R d Is a sharing parameter;
step 3, defining an update gate z l (1 is more than or equal to l is less than or equal to w +1) is a normalized vector with d dimensions and is used for expressing and fusing each character vector v 1 ,v 2 …v w And semantic features
Figure BDA0001726828550000054
Update probability of, update gate z l Comprises the following steps:
Figure BDA0001726828550000055
wherein a factor matrix W is used (z) ∈R d*d As a sharing parameter;
step 4, utilizing the updated door z l For character vector v 1 ,v 2 …v w And semantic features
Figure BDA0001726828550000056
Performing selective mixing and combining process, aggregating into words and obtaining fixed length vector v w Wherein v is w ∈R d ,v w The calculation formula of (c) is:
Figure BDA0001726828550000057
wherein l is more than or equal to 1 and less than or equal to w + 1.
As a further improved technical solution of the present invention, step S22 specifically includes:
s22.1 definition of reset Gate r l Calculating the influence probability on the target word omega:
Figure BDA0001726828550000061
wherein l is more than or equal to 1 and less than or equal to W, and weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters among the character vectors;
s22.2, using reset gate r l Calculating semantic features of the aggregate influence of each character vector on the target word omega in the window
Figure BDA0001726828550000062
Figure BDA00017268285500000612
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0001726828550000063
an initial vector for a context within a window of the target word;
s22.3, defining an update gate z l (1 is more than or equal to l is less than or equal to w +1) is a normalized vector with d dimensions and is used for expressing and fusing each character vector v 1 ,v 2 …v w And semantic features
Figure BDA0001726828550000064
Update probability of, update gate z l Comprises the following steps:
Figure BDA0001726828550000065
wherein the factor matrix W (z) ∈R d*d Is a sharing parameter;
s22.4, utilizing the update gate z l Initial vector fusing context within omega window of target word
Figure BDA0001726828550000066
And semantic features
Figure BDA0001726828550000067
Through selective mixing and combination processing, the combination vector is obtained through aggregation
Figure BDA0001726828550000068
Figure BDA0001726828550000069
Figure BDA00017268285500000610
Wherein l is more than or equal to 1 and less than or equal to w.
As a further improved technical solution of the present invention, in step S24, the method includes the following steps after the deep learning model is input:
step A1, performing a fitting on the current input v using a threshold combining neural network method t The combined calculation of the character sequence in the window context environment to obtain the combined vector of the window context, and the combined vector is recorded as
Figure BDA00017268285500000611
Step A2, using a threshold-combining neural network method, performing a time alignment of the current sequenceAnd (4) performing combined calculation on all historical hidden state outputs before t to obtain historical hidden state outputs which are recorded as
Figure BDA0001726828550000071
Step A3, defining a reset gate r at each current sequence time t of sequence traversal t Computing historical hidden state outputs
Figure BDA0001726828550000072
For the current input v t Resulting memory probability, reset gate r t The calculation formula of (2) is as follows:
Figure BDA0001726828550000073
wherein, the weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters at each current sequence moment;
step A4, at each current sequence time t of the sequence traversal, for the current input v t Defining an update Gate z t To calculate the output in the history hidden state
Figure BDA0001726828550000074
Under the action of (2), the combined vector of the window context
Figure BDA0001726828550000075
Influence the resulting update probability, update gate z t The calculation formula of (2) is as follows:
Figure BDA0001726828550000076
wherein, the weight matrix W (z) ∈R d*d And an offset vector b (z) ∈R d Sharing parameters at each current sequence moment;
step A5, at each current sequence time t of the sequence traversal, for the current input v t By resetting the gate r t Enhancing output in historical hidden state
Figure BDA0001726828550000077
Under-action combined vector
Figure BDA0001726828550000078
Influencing the value of the energy produced
Figure BDA0001726828550000079
Figure BDA00017268285500000710
Wherein, the weight matrix W (c) ∈R d*d And an offset vector b (c) ∈R d Sharing parameters at each current sequence moment;
step A6, at each current sequence time t of the sequence traversal, by updating the gate z t Calculating the current input v t Hidden state output of received history
Figure BDA00017268285500000711
Hidden state output under influence h t
Figure BDA00017268285500000712
Wherein z is t To update the door.
As a further improved technical solution of the present invention, step S3 specifically includes:
s31, at each current sequence moment t, segmenting all candidate words ending with the current target word according to the preset maximum word length, and for each candidate word, looking up a table to obtain the feature vector of each word in the candidate words
Figure BDA0001726828550000081
And corresponding label transfer vector
Figure BDA0001726828550000082
Linearly combined into a character vector v l
Figure BDA0001726828550000083
Wherein L is more than or equal to 1 and less than or equal to L, and L is the number of characters contained in the current candidate word;
s32, using threshold combination network method to make character vector v contained in candidate word l One candidate word vector is fused
Figure BDA0001726828550000084
S33, selecting the word vector from the fused candidate word vector
Figure BDA0001726828550000085
Calculating with a shared weight vector parameter u inner product to obtain a word score
Figure BDA0001726828550000086
Figure BDA0001726828550000087
S34, vector of candidate words
Figure BDA0001726828550000088
Sending the candidate words into a deep learning model, obtaining the historical characteristics of the current candidate words through coding, and performing historical reference
Figure BDA0001726828550000089
Using a cluster searching algorithm, according to a preset cluster width k, at each current sequence time t of sentence forward traversal, all the time, recording and storing k historical segmentations with better scores, wherein the hidden state output of segmentation sentence end words is h t
S35, calculating at each current sequence time t of sequence traversalOutput h in a hidden state t
S36, to produce h t As input, the tanh activation function is used for providing prediction for candidate participles possibly generated at the moment t +1, and the value P is predicted t+1 Comprises the following steps:
P t+1 =tanh(W (p) ·h t +b (p) )
wherein, W (p) ∈R d*d And an offset vector b (p) ∈R d Sharing parameters at each current sequence moment;
s37, vector of candidate words
Figure BDA00017268285500000810
Inputting the data into a deep learning model, and outputting h based on the deep learning model t Calculating to obtain a prediction P for the next word t+1 The deep learning model can acquire memory information in the whole previous word segmentation history and calculate the link score linkScore (y) of a sequence t+1 ):
linkScore(y t+1 )=P t+1 ·y t+1
S38, setting bundle width k, keeping k results with highest score in each step, continuing operation of new input on the kept segmentation, directly modeling word results by using complete segmentation history, defining word sequence y [1: m ] generated by deep learning model prediction, and constructing a score function of the segmented word sequence as follows:
Figure BDA0001726828550000091
s39, giving a given character sequence x i Is represented as y i And defining a structured interval loss for predicting the segmented sentences to construct a loss function and updating parameters reversely. .
As a further improved technical solution of the present invention, step S35 specifically includes:
s35.1, defining a reset gate r at each current sequence time t of sequence traversal t CalculatingHistory hidden state output
Figure BDA0001726828550000092
For the current input candidate word vector
Figure BDA0001726828550000093
Resulting memory probability:
Figure BDA0001726828550000094
wherein, the weight matrix
Figure BDA00017268285500000914
And an offset vector
Figure BDA00017268285500000915
Sharing parameters at each current sequence moment;
s35.2, at each current sequence moment t of sequence traversal, inputting the current candidate word vector
Figure BDA0001726828550000095
Defining an update gate z t Computing output in historical hidden states
Figure BDA0001726828550000096
The update probability generated under the action of (2);
Figure BDA0001726828550000097
wherein, the weight matrix
Figure BDA00017268285500000916
And an offset vector
Figure BDA00017268285500000917
Sharing parameters at each current sequence moment;
s35.3, at each traverse of the sequenceAt the moment t of the current sequence, for the current input candidate word vector
Figure BDA0001726828550000098
By resetting the gate r t Enhancing output in historical hidden state
Figure BDA0001726828550000099
Energy value produced under action
Figure BDA00017268285500000910
Figure BDA00017268285500000911
Wherein, the weight matrix
Figure BDA00017268285500000912
And an offset vector
Figure BDA00017268285500000913
Sharing parameters at each current sequence moment;
s35.4, at each current sequence time t of sequence traversal, updating the gate z t Calculating the current input candidate word vector
Figure BDA0001726828550000101
Hidden state output of received history
Figure BDA0001726828550000102
Hidden state output under influence h t
Figure BDA0001726828550000103
As a further improved technical solution of the present invention, step S39 specifically includes the following steps:
s39.1, for a given training sentence sequence sen [1: n ], generating a word sequence y [1: m ] through model prediction, wherein a word sequence score function after word segmentation is as follows:
Figure BDA0001726828550000104
s39.2, training by adopting a maximum interval method, and giving a training sentence sequence sen [1: n ]]The correct sequence of parts-of-words is denoted y (i),t The sequence of parts of words predicted by the model is expressed as
Figure BDA0001726828550000106
The structured interval loss is defined as:
Figure BDA0001726828550000107
wherein μ is an attenuation parameter;
s39.3, a training set beta is given, a loss function of 2 norm terms is added, and parameters are updated reversely through the loss function:
Figure BDA0001726828550000108
wherein:
Figure BDA0001726828550000109
the beneficial effects of the invention are: the method comprises the steps of initializing a word face vector of each training word, capturing historical characteristics, future characteristics and word characteristics carried by the training words by using a deep learning model, refining a distributed vector based on a threshold combination neural network method to represent corresponding candidate words, reformulating Chinese word segmentation into a direct segmentation learning task, directly evaluating the relative possibility of different word segmentation sentences, and then searching a word segmentation sequence with the highest score, thereby obtaining the word segmentation capability of more sequence levels.
Drawings
FIG. 1 is a flow chart of the deep learning-based Chinese word segmentation method of the present invention.
Fig. 2 is a schematic diagram of a threshold combining neural network method according to the present invention.
FIG. 3 is a schematic diagram of an improved LTSM model according to the present invention.
FIG. 4 is a diagram of the radial refinement architecture of a CRF layer in accordance with the present invention.
Fig. 5 is a schematic diagram of a bundle search algorithm based on dynamic programming according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and detailed description.
As shown in fig. 1, a method for chinese word segmentation based on deep learning includes the following steps:
s1, performing word frequency statistics on the large-scale corpus D, initializing each word in the corpus D into a word vector based on continuous word bag (CBOW) and hierarchical normalization (HS) training methods, and storing the obtained word vector into a dictionary V according to indexes;
s2, converting the training corpus sentence by sentence into vectors with fixed length, sending the vectors into a deep learning model, wherein the deep learning model selects an improved LSTM model, and carries out refinement updating on the face vectors in the dictionary V to obtain feature vectors carrying context semantics and vectors containing word features;
and S3, for each training sentence, when training word by word, segmenting all candidate words ending with the current target word according to the preset maximum word length, fusing the refined feature vectors into word vectors of each candidate word, connecting the candidate words with the previous word segmentation history in an increasing mode, and performing dynamic word segmentation by using a cluster searching method.
Wherein, step S1 specifically includes the following steps:
s11, extracting the basic features of each Chinese character, and constructing a dictionary V by traversing the corpus set D, wherein the dictionary V records the face, the word frequency and the corresponding word embedding vector of the word facing the training corpus;
s12, constructing a complete Huffman tree based on word frequency by the dictionary V, wherein words in the dictionary V are all positioned at leaf nodes of the tree, and establishing a quick index and lookup mechanism through an auxiliary hash table;
s13, initializing each character in the corpus D into a literal vector based on the continuous bag-of-words model and the idea of hierarchical normalization, and constructing to obtain a system objective function
Figure BDA0001726828550000128
Figure BDA0001726828550000121
Wherein the target word omega is the center of the window, l ω As a path from the root node to the target word ω, d ω For encoding from the root node to the target word omega, x ω As the mean of the context literal vectors within the window of the target word omega,
Figure BDA0001726828550000129
to calculate the mean value x of the literal vector of the context ω Parameter values carried at the current branch node;
s14, defining the traversal of path nodes as an iteration cycle, training in the iteration cycle with gradient
Figure BDA0001726828550000122
The parameter theta is overlapped and the semantic influence factor is accumulated
Figure BDA0001726828550000123
After an iteration cycle is finished, updating each context environment literal vector in the window of the target word omega and the environment literal vector of the target word omega
Figure BDA0001726828550000124
Comprises the following steps:
Figure BDA0001726828550000125
wherein mu represents the learning rate, and the proper adjustment is made according to the gradient change rate in the training.
Wherein, step S13 specifically includes: inputting the corpora in the corpus set D according to sentences, traversing the training sentence S in sequence by using a dynamic variable window, taking the center of the window as a target word omega, forming context (omega) of the target word omega by other words in the window, and pre-counting a path l from a root node to the target word omega in a Huffman tree for each training sample (omega | context (omega)) ω And coding d ω Training takes traversal of path nodes as an iteration cycle, and takes context literal vector mean value x in omega window of target word ω For input, the parameter θ is overlapped by a gradient descent method by using a sigmoid activation function as shown in the following formula, and the semantic influence of the context (ω) on the target word ω is calculated:
Figure BDA0001726828550000126
in the traversal, each branch node on the path is regarded as an implicit binary classifier, and each component in the Huffman coding is judged by judging
Figure BDA0001726828550000127
Is 1 (left subtree node) or 0 (right subtree node), calculates the context vector mean x ω Parameter values carried at the current branch node
Figure BDA0001726828550000131
Under the action of the data processing method, influence factors on the omega semantics of the target words are constructed to obtain a system target function
Figure BDA0001726828550000132
Figure BDA0001726828550000133
Wherein the content of the first and second substances,target word omega is the center of the window l ω Is a path from the root node to the target word ω, d ω For encoding from the root node to the target word omega, x ω Is the mean of the context literal vectors within a window of the target word ω.
Step S2 specifically includes the following steps:
s21, for the current sequence time t, executing the lookup table operation from the dictionary V according to the index to obtain the initial vector of the target word omega
Figure BDA0001726828550000134
S22, for the window context of the target word omega, the initial vector in the window of the target word omega context is taken out according to the index
Figure BDA0001726828550000135
L is more than or equal to 1 and less than or equal to w, w represents the window width, and the environment word vector in the window is expressed into a combined vector by using a threshold combined neural network method and is marked as
Figure BDA0001726828550000136
S23, for the current sequence time t, using a threshold combined neural network method, and calculating to obtain the hidden selective historical output of the previous 1-t-1 time according to the method of the step S22
Figure BDA0001726828550000137
As an input to the forward modified LTSM; computing the selective future output of the hidden state at the time t + 1-n simultaneously
Figure BDA0001726828550000138
As an input to the backward modified LTSM.
S24, at the time t of the current sequence, using the initial vector of the target word omega
Figure BDA0001726828550000139
And a combined vector
Figure BDA00017268285500001310
As input, respectively sending into Bi-LSTM architecture constructed by improved LSTM model for expansion training, and generating historical characteristic output
Figure BDA00017268285500001311
And future feature output
Figure BDA00017268285500001312
S25, outputting the historical characteristics at the current sequence time t
Figure BDA00017268285500001313
And future feature output
Figure BDA00017268285500001314
Are linearly combined to form h t
Figure BDA00017268285500001315
Generating network output using tanh activation function
Figure BDA00017268285500001316
Output the network
Figure BDA00017268285500001317
Updating the word vector into a dictionary V to obtain a refined word vector integrated with context semantics
Figure BDA00017268285500001318
Figure BDA00017268285500001319
Wherein, W (o) ∈R d*2d And an offset vector b (o) ∈R d Sharing parameters at each current sequence moment;
s26, using training set beta sentence by word by table look-up to obtain refined word vector merged into context semantic
Figure BDA0001726828550000141
Sending the sentence matrix into a CRF layer to construct a sentence matrix expression as an observation state matrix, developing iterative training by using a viterbi algorithm, defining a sentence scoring formula, and determining an optimal labeling sequence:
Figure BDA0001726828550000142
wherein, A yi,yi+1 Is a state transition matrix;
the choice of the bitwise tag in the output token sequence uses the { BIES } tag rule set, where B denotes the first word of a word, I denotes the middle word of a word, E denotes the last word of a word, and S denotes word-by-word, the { BIES } tag rule set is combined with the part-of-speech tag (e.g., S-V denotes the verb of a word-by-word) to obtain the optimal tag transfer matrix in the character sequence
Figure BDA0001726828550000143
And marking the optimal label selection of the characters in the training sentence by taking the optimal label transfer matrix as the character characteristic basis of the subsequent word segmentation.
Wherein, the combined vector is obtained in step S22
Figure BDA0001726828550000144
The method comprises the following specific steps:
s22.1 definition of reset Gate r l Calculating the influence probability on the target word omega:
Figure BDA0001726828550000145
wherein l is more than or equal to 1 and less than or equal to W, and weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters among the character vectors;
s22.2, using reset gate r l Calculating semantic features of the aggregate influence of each character vector on the target word omega in the window
Figure BDA0001726828550000146
Figure BDA0001726828550000147
Wherein the content of the first and second substances,
Figure BDA0001726828550000148
an initial vector for the context within the window of the target word omega.
S22.3, define the update Gate z l (1 ≤ l ≤ w +1) is normalized vector with d dimension for expressing and fusing each character vector v 1 ,v 2 …v w And semantic features
Figure BDA0001726828550000149
Update probability of, update gate z l Comprises the following steps:
Figure BDA0001726828550000151
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0001726828550000152
for the initial vector of the context within the window of the target word, the factor matrix W (z) ∈R d*d Is a sharing parameter;
s22.4, utilizing the update gate z l Initial vector fusing context within omega window of target word
Figure BDA0001726828550000153
And semantic features
Figure BDA0001726828550000154
Through selective mixing and combining treatment, the combination vector is obtained through aggregation
Figure BDA0001726828550000155
Figure BDA0001726828550000156
Figure BDA0001726828550000157
Wherein l is more than or equal to 1 and less than or equal to w.
As shown in fig. 2, the threshold combining neural network method mentioned in steps S22 and S23 includes the steps of:
step 1, defining w character vectors needing to be combined in a character group, wherein the w character vectors are v 1 ,v 2 …v w Wherein v is 1 ,v 2 …v w ∈R d Defining a weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d For sharing parameters, a reset gate r is defined l By resetting the gate r l Calculating the probability of combined memory, resetting the gate r l The calculation formula of (2) is as follows:
r l =σ(W (r) ·v l +b (r) )
wherein l is more than or equal to 1 and less than or equal to w;
step 2, in the character combination, a reset gate r is used l Calculating each character vector v 1 ,v 2 …v w Semantic features resulting from clustering into target words
Figure BDA0001726828550000158
Semantic features
Figure BDA0001726828550000159
The calculation formula of (2) is as follows:
Figure BDA00017268285500001510
wherein, the weight matrix W (l) ∈R d*d And an offset vector b (l) ∈R d Is a sharing parameter;
step 3, defining an update gate z l (1 ≦ l ≦ w +1) as a d-dimensional normalized vector for expressing each of the fusion partnersCharacter vector v 1 ,v 2 …v w And semantic features
Figure BDA0001726828550000161
Update probability of, update gate z l Comprises the following steps:
Figure BDA0001726828550000162
wherein a factor matrix W is used (z) ∈R d*d As a sharing parameter;
step 4, utilizing the updated door z l For character vector v 1 ,v 2 …v w And semantic features
Figure BDA0001726828550000163
Selectively mixing and combining to obtain fixed-length vector v w Wherein v is w ∈R d ,v w The calculation formula of (2) is as follows:
Figure BDA0001726828550000164
wherein l is more than or equal to 1 and less than or equal to w + 1.
As shown in fig. 3, the improved LSTM model in step S24 is used for capturing historical memory information in the sequential traversal of the sequence labeling problem, and includes the following steps:
step A1, performing a fitting on the current input v using a threshold combining neural network method t The combination calculation of the character sequence in the window context environment is carried out to obtain the combination vector of the window context, and the combination vector is marked as
Figure BDA0001726828550000165
Step A2, using a threshold combination neural network method to perform the combination calculation of all historical hidden state outputs before the current sequence time t to obtain the historical hidden state output which is recorded as
Figure BDA0001726828550000166
Step A3, at each current sequence time t of the sequence traversal, defines a reset gate r t Computing historical hidden state outputs
Figure BDA0001726828550000167
For the current input v t Resulting memory probability, reset gate r t The calculation formula of (c) is:
Figure BDA0001726828550000168
wherein, the weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters at each current sequence moment;
step A4, at each current sequence time t of the sequence traversal, for the current input v t Defining an update Gate z t To calculate the output in the history hidden state
Figure BDA0001726828550000171
Under the action of (2), the combined vector
Figure BDA0001726828550000172
Influence the resulting update probability, update gate z t The calculation formula of (2) is as follows:
Figure BDA0001726828550000173
wherein, the weight matrix W (z) ∈R d*d And an offset vector b (z) ∈R d Sharing parameters at each current sequence moment;
step A5, at each current sequence time t of the sequence traversal, for the current input v t By resetting the gate r t Enhancing output in historical hidden state
Figure BDA0001726828550000174
Under-action combined vector
Figure BDA0001726828550000175
Influencing the value of the energy produced
Figure BDA0001726828550000176
Figure BDA0001726828550000177
Wherein, the weight matrix W (c) ∈R d*d And an offset vector b (c) ∈R d Sharing parameters at each current sequence moment;
step A6, at each current sequence time t of the sequence traversal, by updating gate z t Calculating the current input v t Hidden state output of received history
Figure BDA0001726828550000178
Hidden state output h under influence t
Figure BDA0001726828550000179
In addition, in step S24, in the forward modified LSTM process, first, a reset gate is defined
Figure BDA00017268285500001710
Computing historical hidden state outputs
Figure BDA00017268285500001711
For the current input
Figure BDA00017268285500001712
Generated memory probability, reset gate
Figure BDA00017268285500001713
Comprises the following steps:
Figure BDA00017268285500001714
wherein, the weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters at each current sequence moment;
then, an update gate is defined
Figure BDA00017268285500001715
Computing output in historical hidden state
Figure BDA00017268285500001716
Under the action of (2), is subjected to combined vector
Figure BDA00017268285500001717
Influence the generated update probability, update gate
Figure BDA00017268285500001718
Comprises the following steps:
Figure BDA00017268285500001719
wherein, the weight matrix W (z) ∈R d*d And an offset vector b (z) ∈R d Sharing parameters at each current sequence moment;
then, for the currently input target word
Figure BDA0001726828550000181
By resetting the door
Figure BDA0001726828550000182
Enhancing output in historical hidden state
Figure BDA0001726828550000183
Under-action combined vector
Figure BDA0001726828550000184
Influencing the value of the energy produced
Figure BDA0001726828550000185
Figure BDA0001726828550000186
Wherein, the weight matrix W (c) ∈R d*d And an offset vector b (c) ∈R d Sharing parameters at each current sequence moment;
finally, the door is renewed
Figure BDA0001726828550000187
Mainly determines how much information needs to be forgotten and kept, thereby calculating the current input target word
Figure BDA0001726828550000188
Hidden state output of received history
Figure BDA0001726828550000189
And implicit output under the influence of window context
Figure BDA00017268285500001810
Figure BDA00017268285500001811
The backward modified LSTM process is similar to the forward process, first, a reset gate is defined
Figure BDA00017268285500001812
Computing future hidden state outputs
Figure BDA00017268285500001813
For the current input
Figure BDA00017268285500001814
Generated prediction probabilityResetting door
Figure BDA00017268285500001815
Comprises the following steps:
Figure BDA00017268285500001816
wherein, the weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters at each current sequence moment;
then, an update gate is defined
Figure BDA00017268285500001817
Computing the incoming implicit output
Figure BDA00017268285500001818
Under the action of (2), the combined vector
Figure BDA00017268285500001819
Influence the generated update probability, update gate
Figure BDA00017268285500001820
Comprises the following steps:
Figure BDA00017268285500001821
wherein, the weight matrix W (z) ∈R d*d And an offset vector b (z) ∈R d Sharing parameters at each current sequence moment;
then, for the currently input target word
Figure BDA00017268285500001822
By resetting the door
Figure BDA00017268285500001823
Enhancing future predictions
Figure BDA00017268285500001824
Under-action combined vector
Figure BDA00017268285500001825
Influencing the value of the energy produced
Figure BDA00017268285500001826
Figure BDA00017268285500001827
Wherein, the weight matrix W (c) ∈R d*d And an offset vector b (c) ∈R d Sharing parameters at each current sequence moment;
finally, the door is renewed
Figure BDA0001726828550000191
Mainly determines how much information needs to be forgotten and kept, thereby calculating the current input target word
Figure BDA0001726828550000192
Hidden state output under influence of future prediction and window context
Figure BDA0001726828550000193
Figure BDA0001726828550000194
Step S3 specifically includes:
s31, at each current sequence moment t, segmenting all candidate words ending with the current target word according to the preset maximum word length, and for each candidate word, looking up a table to obtain a feature vector of each word in the candidate words with context semantics
Figure BDA0001726828550000195
And corresponding label transfer vector
Figure BDA0001726828550000196
Linearly combined into a character vector v l
Figure BDA0001726828550000197
Wherein L is more than or equal to 1 and less than or equal to L, and L is the number of characters contained in the current candidate word;
s32, using threshold combination network method to make character vector v contained in candidate word l One candidate word vector is fused
Figure BDA0001726828550000198
S33, fusing the candidate word vectors
Figure BDA0001726828550000199
Calculating with a shared weight vector parameter u inner product to obtain a word score
Figure BDA00017268285500001910
Figure BDA00017268285500001911
S34, vector of candidate words
Figure BDA00017268285500001912
Sending into improved LSTM model, encoding to obtain current candidate word history characteristics, and history reference
Figure BDA00017268285500001913
Using a cluster searching algorithm, according to a preset cluster width k, at each current sequence time t of sentence forward traversal, always recording and storing k historical segmentations with better scores, wherein the hidden state output of segmentation sentence end words is h t
S35, at each current sequence time of sequence traversalt, calculating the hidden state output h t
S36, to produce h t As input, the tanh activation function is used for providing prediction for candidate participles possibly generated at the moment t +1, and the value P is predicted t+1 Comprises the following steps:
P t+1 =tanh(W (p) ·h t +b (p) )
wherein, W (p) ∈R d*d And an offset vector b (p) ∈R d Sharing parameters at each current sequence moment;
s37, vector of candidate words
Figure BDA0001726828550000201
Inputting the output h into the improved LSTM model based on the output h of the improved LSTM model t Calculating to obtain the prediction P of the next word t+1 The improved LSTM model can acquire memory information in the whole previous word segmentation history and calculate the link score linkScore (y) of a sequence t+1 ):
linkScore(y t+1 )=P t+1 ·y t+1
S38, setting bundle width k, keeping k results with highest score at each step, continuing operation of new input on the kept segmentation, directly modeling word results by using complete segmentation history, defining word sequence y [1: m ] predicted and generated by an improved LSTM model, and constructing a score function of the segmented word sequence as follows:
Figure BDA0001726828550000202
s39, giving a given character sequence x i Is represented as y i A structured interval loss for predicting segmented sentences is defined.
Wherein, step S35 specifically includes:
s35.1, defining a reset gate r at each current sequence time t of sequence traversal t Computing historical hidden state output
Figure BDA0001726828550000203
For the current input candidate word vector
Figure BDA0001726828550000204
Resulting memory probability:
Figure BDA0001726828550000205
wherein, the weight matrix
Figure BDA0001726828550000206
And an offset vector
Figure BDA0001726828550000207
Sharing parameters at each current sequence moment;
s35.2, at each current sequence moment t of sequence traversal, inputting the current candidate word vector
Figure BDA0001726828550000208
Defining an update Gate z t Computing output in historical hidden states
Figure BDA0001726828550000209
The update probability generated under the action of (2);
Figure BDA00017268285500002010
wherein, the weight matrix
Figure BDA00017268285500002011
And an offset vector
Figure BDA00017268285500002012
Sharing parameters at each current sequence moment;
s35.3, at each current sequence moment t of sequence traversal, inputting the current candidate word vector
Figure BDA0001726828550000211
By resetting the gate r t Enhancing output in historical hidden state
Figure BDA0001726828550000212
Energy value produced under action
Figure BDA0001726828550000213
Figure BDA0001726828550000214
Wherein, the weight matrix
Figure BDA0001726828550000215
And an offset vector
Figure BDA0001726828550000216
Sharing parameters at each current sequence moment;
s35.4, at each current sequence time t of sequence traversal, updating the gate z t Calculating the current input candidate word vector
Figure BDA0001726828550000217
Hidden state output of received history
Figure BDA0001726828550000218
Hidden state output h under influence t
Figure BDA0001726828550000219
Step S39 specifically includes the following steps:
s39.1, for a given training sentence sequence sen [1: n ], generating a word sequence y [1: m ] through model prediction, wherein a word sequence score function after word segmentation is as follows:
Figure BDA00017268285500002110
s39.2, training by adopting a maximum interval method, and giving a training sentence sequence sen [1: n ]]The correct sequence of parts-of-words is denoted y (i),t The word sequence predicted by the model is expressed as
Figure BDA00017268285500002111
Define structured interval loss as:
Figure BDA00017268285500002112
wherein mu is an attenuation parameter used for adjusting the loss function value;
s39.3, a training set beta is given, a loss function of 2 norm terms is added, and parameters are updated reversely through the loss function:
Figure BDA00017268285500002113
wherein:
Figure BDA00017268285500002114
example 1
A Chinese word segmentation method based on deep learning comprises the following steps:
step 1: performing literal word frequency statistics on the large-scale corpus D, initializing each word in the corpus D into a basic distributed literal vector based on a CBOW model and an HS training method, and storing the obtained literal vector into a dictionary V according to indexes.
And 2, step: converting the training corpus into vectors with fixed length sentence by sentence, sending the vectors into an improved bidirectional LSTM model, and refining and updating the literal vectors at the character level in the dictionary V through training the parameters in the bidirectional LSTM model to obtain the feature vectors carrying context semantics and the vectors containing the literal features.
And 3, step 3: when each training sentence is trained word by word, all candidate words ending with the current word are segmented within the maximum word length range by using a full segmentation idea, the refined character level feature vectors are fused into a word vector of each candidate word, the candidate words are incrementally connected with the previous word segmentation history, and dynamic word segmentation is performed by using a cluster searching method.
Specifically, the basic features of each Chinese character are extracted in the first step, a dictionary V is constructed for the Chinese characters in the corpus D by traversing the large-scale corpus D, and the dictionary V records the face, the word frequency and the corresponding word embedding vector representation of the Chinese characters facing the training corpus. A complete Huffman tree is built for the dictionary V based on word frequency, words in the dictionary V are located at leaf nodes of the tree, and a quick indexing and searching mechanism is built through an auxiliary hash table. Training of literal vectors Using the CBOW model and HS training methods, a system objective function is constructed as shown below:
Figure BDA0001726828550000221
inputting the training corpus in corpus set D according to sentence, traversing the training sentence S in sequence by using a dynamic variable window, wherein the center of the window is a target character omega, and training in an iteration period by gradient
Figure BDA0001726828550000222
The parameter theta is overlapped, and the semantic influence factors are accumulated at the same time
Figure BDA0001726828550000223
After the end of an iteration cycle, the literal vectors for each context within the window of the target word ω are updated by:
Figure BDA0001726828550000231
wherein mu represents the learning rate, and the training is properly adjusted according to the gradient change rate.
Secondly, training each input training sentence word by word, taking out a corresponding initial vector from the dictionary V according to the index, and recording the initial vector as the initial vector
Figure BDA0001726828550000232
Taking 50 as vector dimension d in training, taking out word vectors in a corresponding context window according to the index, combining the word vectors in the context window into a combined vector by using a threshold combined neural network, and recording the combined vector as the combined vector
Figure BDA0001726828550000233
The effect of the window context on the target word is expressed.
To vector the character face
Figure BDA0001726828550000234
Combining vectors with window context
Figure BDA0001726828550000235
The input into the improved Bi-LSTM model is that the input of forward LSTM is the sequence passing from left to right, and the input of backward LSTM is that the sequence passes from right to left. And finally, splicing the outputs of the two hidden layer units to serve as the output of the whole network hidden layer.
And outputting a label T, wherein the lexeme label selection uses a { BIES } mark rule set, wherein B represents a first word of a word, M represents a middle word of the word, E represents a last word of the word, and S represents that the word is formed by a single word, and S is combined with a part-of-speech tag, for example, S-V represents a verb of the word formed by the single word, the part-of-speech tags are 13 in number, and the part-of-speech tag combination label set is 52 in number. And carrying out mapping transformation on the hidden layer output, and carrying out nonlinear transformation as output.
o i =tanh(w o ·[h i1 ,h i2 ]+b o )
The sentence score formula is defined as:
Figure BDA0001726828550000236
where θ is the set of model parameters, A yi,yi+1 Is the transition state matrix and N is the number of words in the training sentence.
And (3) using a dynamic programming algorithm in the decoding process, and finally selecting the labeling sequence as the sequence with the highest calculation score:
Figure BDA0001726828550000237
wherein Y is X For all possible annotation sequences. Normalizing all labels by using softmax, calculating a score as conditional label path probability, taking logarithm to the conditional label path probability to obtain a conditional probability likelihood function of an effective path, reversely training an update parameter and a literal vector
Figure BDA0001726828550000241
The network trains the parameters by maximizing the likelihood function of the sentence labels:
Figure BDA0001726828550000242
model output is sent to a CRF layer, the optimal labeling sequence is determined by using a viterbi algorithm, sentence label scores are formed by a transition state matrix and network output, and parameters and a literal vector are reversely updated by maximizing a likelihood function of the sentence labels
Figure BDA0001726828550000243
The trained literal vector is the characteristic vector carrying context information which needs to be extracted and is recorded as
Figure BDA0001726828550000244
In the state transition matrix, each row represents the possibility that all possible labels of the previous character are transferred to a certain character label of the current character, and the transfer vector corresponding to the optimal label sequence is extracted as a characteristic vector carrying character characteristics and recorded as the characteristic vector
Figure BDA0001726828550000245
The third step is word-taking operation, the training sentence set marked in the second step is traversed word by word, and for each word in the training sentences, the vector of the word is extracted through the operation of a lookup table
Figure BDA0001726828550000246
Dividing all candidate words ending with the current word in the maximum word length range by using a forward full segmentation idea according to a preset maximum word length (such as the word length value is 4), fusing and representing character level vectors into word level vectors based on the candidate words by using a threshold combination neural network (GCNN)
Figure BDA0001726828550000247
Figure BDA0001726828550000248
Wherein, L is the number of characters in the current candidate word;
the fused word level vector
Figure BDA0001726828550000249
And (3) inner product with a shared weight vector parameter u, calculating the score of the current candidate word:
Figure BDA00017268285500002410
vector the word level
Figure BDA00017268285500002411
Inputting the candidate words into an improved LSTM model, and obtaining historical characteristic representation of the current candidate words through encoding:
Figure BDA00017268285500002412
since the model performs full segmentation based on the maximum word length using dynamic programming, the number of all possible segmentation labeling results is an exponential order of the length of the character sequence, making it impossible to calculate the score of all possible results. In addition, because the calculation of the current candidate participle indicates that historical information of the candidate word is introduced, the traditional viterbi algorithm is no longer the best choice, and in order to play a better role in practical application, the model searches the hidden layer output h at the previous moment suitable for the candidate word by adopting the clustering algorithm j-1 And decoding is carried out. Specifically, the detailed process of solving the optimization model by the cluster searching algorithm is as follows:
Figure BDA0001726828550000251
in the algorithm, each candidate word generated at each time t of the traversal training sentence is integrated with the matched word segmentation history memory through the improved LSTM model, and prediction is provided for candidate word segmentation possibly generated at the time t + 1:
P t+1 =tanh(w p h t +b p )
thereby calculating the connection score at the t +1 moment;
linkScore(y t+1 )=p t+1 ·y t+1
and through the preset bundling width k, the segmentation of k sentences with higher scores is always kept in the dynamic forward direction of each step, in the process, not only is the complete segmentation history effectively utilized, but also the possible segmentation in the future is predicted, so that the model has the sentence-level discrimination capability. Assuming that a certain word sequence y [1: m ] is generated through model prediction for a given training sentence sequence sen [1: n ], the word sequence score function after word segmentation is as follows:
Figure BDA0001726828550000261
the training uses the maximum Interval (max margin criterion) for a given sequence of training sentences sen [1: n ]]Its correct word segmentationSequence is represented as y (i),t The sequence of parts of words predicted by the model is expressed as
Figure BDA0001726828550000262
Defining structured interval loss:
Figure BDA0001726828550000263
wherein mu is an attenuation parameter used for adjusting the loss function value and playing a smoothing role. Given a training set beta, adding a loss function of 2 norm terms, and reversely updating parameters through the loss function:
Figure BDA0001726828550000264
wherein:
Figure BDA0001726828550000265
the Adagard algorithm is adopted for model optimization, and a dropout regularization technology is used for preventing overfitting.
In the present invention, FIG. 4 is a character-level Bi-LSTM-CRF vector refinement architecture diagram; fig. 5 is a schematic diagram of a bundle search algorithm based on dynamic programming.
In summary, the invention has the following advantages:
(1) the method has the advantages that the more abstract essential features of the data can be captured without engineering feature extraction and excessive prior knowledge, and the text is represented by distributed vectors;
(2) the improved Bi-LSTM model is used in the process of literal vector refinement, so that the character-level vector can acquire information in the front text and the subsequent text at the same time, and a method based on an attention mechanism can better acquire remote context information;
(3) the method comprises the steps of acquiring the character characteristics of a character vector in the character vector refinement process, and improving the accuracy of word segmentation by using character information to assist word segmentation;
(4) semantic information of words in a window around a target word is added in the process of training the refined word vector, so that the extraction of the vector characteristics of the target word is enhanced;
(5) in the word vector fusion method based on the gate-controlled convolutional neural network, the shared parameter matrix is used during calculation, so that the dimensionality of the parameter matrix needing to be trained is reduced, the model calculation efficiency is improved, and the character vectors can be more reasonably fused by using a gate control mechanism;
(6) a cluster searching algorithm: taking the k results with the highest scores at each step and keeping, and then continuing to operate on the kept segmentations by the newly segmented candidate words, compared with the viterbi algorithm, the method can save much decoding time and can directly model the segmentation results by using the complete segmentation history.
The above embodiments are only for illustrating the invention and not for limiting the technical solutions described in the invention, and the understanding of the present specification should be based on the technical personnel in the field, and although the present specification has described the invention in detail with reference to the above embodiments, the technical personnel in the field should understand that the technical personnel in the field can still make modifications or equivalent substitutions to the present invention, and all the technical solutions and modifications thereof without departing from the spirit and scope of the present invention should be covered in the claims of the present invention.

Claims (8)

1. A Chinese word segmentation method based on deep learning is characterized by comprising the following steps:
s1, performing word frequency statistics on the large-scale corpus D, initializing each word in the corpus D into a word vector based on a continuous bag-of-words model and a hierarchical normalization training method, and storing the obtained word vector into a dictionary V according to indexes;
s2, converting the training corpus sentence by sentence into vectors with fixed length, sending the vectors into a deep learning model,
the literal vector in the dictionary V is refined and updated, and a feature vector carrying context semantics and a vector containing a literal feature are obtained;
s3, for each training sentence, when training word by word, according to the preset maximum word length, dividing
Fusing the refined feature vectors into word vectors of each candidate word for all candidate words ending with the current target word, connecting the candidate words with the previous word segmentation history in an increasing mode, and performing dynamic word segmentation by using a cluster searching method;
wherein, step S1 specifically includes:
s11, extracting the basic features of each Chinese character, and constructing a dictionary V by traversing the corpus set D, wherein the dictionary V records the face, the word frequency and the corresponding word embedding vector of the word facing the training corpus;
s12, constructing a complete Huffman tree based on word frequency by the dictionary V, wherein words in the dictionary V are all positioned at leaf nodes of the tree, and establishing a quick index and lookup mechanism through an auxiliary hash table;
s13, initializing each character in the corpus D into a literal vector based on the continuous bag-of-words model and the idea of hierarchical normalization, and constructing to obtain a system objective function
Figure FDA00036690436400000212
Figure FDA0003669043640000021
Wherein the target word omega is the center of the window, l ω As a path from the root node to the target word ω, d ω For encoding from the root node to the target word ω, x ω Is the mean value of the context literal vector within the window of the target word omega,
Figure FDA0003669043640000022
to calculate the mean value x of the literal vector of the context ω The parameter value carried at the current branch point;
s14, defining the traversal of path nodes as an iteration cycle, training in the iteration cycle with gradient
Figure FDA0003669043640000023
The parameter theta is overlapped and the semantic influence factor is accumulated
Figure FDA0003669043640000024
After an iteration cycle is finished, updating each context environment literal vector in the window of the target word omega and the environment literal vector of the target word omega
Figure FDA0003669043640000025
Comprises the following steps:
Figure FDA0003669043640000026
where μ represents the learning rate.
2. The method for Chinese word segmentation based on deep learning of claim 1, wherein the step S2 specifically comprises:
s21, for the current sequence time t, executing the lookup table operation from the dictionary V according to the index to obtain the initial vector of the target word omega
Figure FDA0003669043640000027
Figure FDA0003669043640000028
S22, extracting the initial vector of the context in the omega window of the target word according to the index
Figure FDA0003669043640000029
Figure FDA00036690436400000210
w represents the window width, and the environment word vector in the window is expressed into a combined vector by using a threshold combined neural network method and is marked as
Figure FDA00036690436400000211
S23, for the current sequence time t, using a threshold combined neural network method, and calculating to obtain the hidden selective historical output of the previous 1-t-1 time according to the method of the step S22
Figure FDA0003669043640000031
Computing the selective future output of the hidden state at the time t + 1-n simultaneously
Figure FDA0003669043640000032
S24, at the time t of the current sequence, using the initial vector of the target word omega
Figure FDA0003669043640000033
And a combined vector
Figure FDA0003669043640000034
As input, respectively fed into deep learning model to generate historical characteristic output
Figure FDA0003669043640000035
And future feature output
Figure FDA0003669043640000036
S25, outputting the historical characteristics at the current sequence time t
Figure FDA0003669043640000037
And future feature output
Figure FDA0003669043640000038
The linear combination is performed to form the ht,
Figure FDA0003669043640000039
generation using tanh activation functionNetwork output
Figure FDA00036690436400000310
Output the network
Figure FDA00036690436400000311
Updating the word into a dictionary V to obtain a refined word vector integrated with context semantics
Figure FDA00036690436400000312
Figure FDA00036690436400000313
Wherein, W (o) ∈R d*2d And an offset vector b (o) ∈R d Sharing parameters at each current sequence moment;
s26, using training set beta to obtain refined word vector merged with context semantics by sentence-by-word table look-up to construct sentence matrix expression as state matrix, using viterbi algorithm to develop iterative training, defining sentence scoring formula, and determining the best annotation sequence:
Figure FDA00036690436400000314
ayi, yi +1 is a state transition matrix, the output label sequence selection uses a { BIES } label rule set, wherein B represents the first character of a word, I represents the character in the middle of a word, E represents the last character of a word, S represents the word formation of a single character, and the { BIES } label rule set and part-of-speech label tags are combined to obtain the optimal label transition matrix in the character sequence
Figure FDA00036690436400000315
3. The deep learning-based chinese word segmentation method of claim 2, wherein the threshold-combined neural network method comprises the steps of:
step 1, defining w character vectors to be combined, wherein the w character vectors are v 1 ,v 2 ...v w Wherein v is 1 ,v 2 ...v w ∈R d Defining a weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d For sharing parameters, a reset gate r is defined l By resetting the gate r l Calculating the probability of combined memory, resetting the gate r l The calculation formula of (c) is:
r l =σ(W (r) ·v l +b (r) )
wherein l is more than or equal to 1 and less than or equal to w;
step 2, in the character combination, a reset gate r is used l Calculating each character vector v 1 ,v 2 ...v w Semantic features resulting from aggregation into target words
Figure FDA0003669043640000041
Semantic features
Figure FDA0003669043640000042
The calculation formula of (2) is as follows:
Figure FDA0003669043640000043
wherein, the weight matrix W (l) ∈R d*d And an offset vector b (l) ∈R d Is a sharing parameter;
step 3, defining an update gate z l (1 ≤ l ≤ w +1) is normalized vector with d dimension for expressing and fusing each character vector v 1 ,v 2 ...v w And semantic features
Figure FDA0003669043640000044
Update probability of, update gate Z l Comprises the following steps:
Figure FDA0003669043640000045
wherein the factor matrix W (z) ∈R d*d Is a sharing parameter;
step 4, utilizing the update gate zl to pair the character vector v 1 ,v 2 ...v w And semantic features
Figure FDA0003669043640000046
Selectively mixing and combining to obtain fixed-length vector v w Wherein v is w ∈R d ,v w The calculation formula of (2) is as follows:
Figure FDA0003669043640000051
wherein l is more than or equal to 1 and less than or equal to w + 1.
4. The deep learning-based Chinese word segmentation method of claim 3, wherein the step S22 specifically comprises:
s22.1 definition of reset Gate r l Calculating the influence probability on the target word omega:
Figure FDA0003669043640000052
wherein l is more than or equal to 1 and less than or equal to W, and weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters among the character vectors;
s22.2, using reset gate r l Calculating semantic features of the aggregate influence of each character vector on the target word omega in the window
Figure FDA0003669043640000053
Figure FDA0003669043640000054
Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003669043640000055
an initial vector for a context within a window of the target word;
s22.3, defining an updating gate zl (l is more than or equal to 1 and less than or equal to w +1) as a d-dimensional normalized vector for expressing and fusing each character vector v 1 ,v 2 ...v w And semantic features
Figure FDA0003669043640000056
Update gate zl is:
Figure FDA0003669043640000057
wherein the factor matrix W (r) ∈R d*d Is a sharing parameter;
s22.4, fusing the initial vector of the context in the omega window of the target word by using the zl of the updating gate
Figure FDA0003669043640000061
And semantic features
Figure FDA0003669043640000062
Through selective mixing and combining treatment, the combination vector is obtained through aggregation
Figure FDA0003669043640000063
Wherein l is more than or equal to 1 and less than or equal to w.
5. The method for Chinese word segmentation based on deep learning of claim 2, wherein the step of entering the deep learning model in step S24 comprises the following steps:
step A1, performing a pair using a threshold combining neural network methodThe combination calculation of the character sequence in the current input vt window context environment is carried out to obtain the combination vector of the window context, and the combination vector is recorded as
Figure FDA0003669043640000064
Step A2, using a threshold combination neural network method to perform the combination calculation of all historical hidden state outputs before the current sequence time t to obtain the historical hidden state output which is recorded as
Figure FDA0003669043640000065
Step A3, defining a reset gate rt to calculate the historical hidden state output at each current sequence time t of sequence traversal
Figure FDA0003669043640000066
For the memory probability generated by the current input vt, the calculation formula for the reset gate rt is as follows:
Figure FDA0003669043640000067
wherein, the weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters at each current sequence moment;
step A4, at each current sequence time t of the sequence traversal, defining an update gate zt for the current input vt for calculating the output in the historical hidden state
Figure FDA0003669043640000068
Under the action of (2), the combined vector of the window context
Figure FDA0003669043640000069
The probability of an update being generated is influenced,
the calculation formula for the update gate zt is:
Figure FDA00036690436400000610
wherein, the weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters at each current sequence moment;
step A5, at each current sequence time t of sequence traversal, intensifying historical hidden state output of current input vt through reset gate rt
Figure FDA0003669043640000071
Under-action combined vector
Figure FDA0003669043640000072
Influencing the value of the energy produced
Figure FDA0003669043640000073
Figure FDA0003669043640000074
Wherein, the weight matrix W (r) ∈R d*d And an offset vector b (r) ∈R d Sharing parameters at each current sequence moment;
step A6, at each current sequence time t of sequence traversal, calculating current input vt by updating gate zt and outputting historical hidden state
Figure FDA0003669043640000075
The hidden state under influence is output as ht,
Figure FDA0003669043640000076
where zt is the update gate.
6. The method for Chinese word segmentation based on deep learning of claim 5, wherein the step S3 specifically comprises:
s31, at each current sequence time t, segmenting all the current target characters ending according to the preset maximum word length
For each candidate word, looking up a table to obtain the characteristic vector of each word in the candidate word
Figure FDA0003669043640000077
And corresponding label transfer vector
Figure FDA0003669043640000078
Linear combination is a character vector vl;
Figure FDA0003669043640000079
wherein L is more than or equal to 1 and less than or equal to L, and L is the number of characters contained in the current candidate word;
s32, character vectors vl contained in the candidate words are fused into a candidate word vector by using a threshold combination network method
Figure FDA0003669043640000081
S33, selecting the word vector from the fused candidate word vector
Figure FDA0003669043640000082
Calculating with a shared weight vector parameter u inner product to obtain a word score
Figure FDA0003669043640000083
Figure FDA0003669043640000084
S34, vector of candidate words
Figure FDA0003669043640000085
Sending the data into a deep learning model, obtaining the historical characteristics of the current candidate words through coding, and referring to the historical characteristics
Figure FDA0003669043640000086
Using a cluster searching algorithm, according to a preset cluster width k, at each current sequence time t traversed by the sentence in the forward direction, always recording and storing k historical segmentations with better scores, wherein the hidden state output of the segmentation sentence end words is ht;
s35, calculating hidden state output ht at each current sequence time t of sequence traversal;
s36, taking the generated ht as an input, using a tanh activation function to provide prediction for candidate participles possibly generated at the t +1 moment, wherein a predicted value Pt +1 is as follows:
P t+1 =tanh(W (p) ·h t +b (p) )
wherein, W (p) ∈R d*d And an offset vector b (p) ∈R d Sharing parameters at each current sequence moment;
s37, inputting the candidate word vector into a deep learning model, calculating and obtaining the prediction Pt +1 of the next word based on the output ht of the deep learning model, wherein the deep learning model can obtain the memory information in the whole word segmentation history before, and calculating the connection score linkScore (yt +1) of a sequence:
linkScore(yt+1)=Pt+1·yt+1;
s38, setting bundle width k, keeping k results with highest score at each step, continuing operation of new input on the kept segmentation, directly modeling word results by using complete segmentation history, defining words generated by deep learning model prediction
And the sequence y [1: m ], wherein the segmented word sequence score function is constructed as follows:
Figure FDA0003669043640000091
s39, giving a given character sequence x i Is represented as y i And defining a structured interval loss for predicting the segmented sentences to construct a loss function, and updating parameters reversely.
7. The method for Chinese word segmentation based on deep learning of claim 6, wherein the step S35 specifically comprises:
s35.1, defining a reset gate rt at each current sequence moment t of sequence traversal, and calculating historical hidden state output
Figure FDA0003669043640000092
For current input candidate word vector
Figure FDA0003669043640000093
Resulting memory probability:
Figure FDA0003669043640000094
wherein, the weight matrix
Figure FDA00036690436400000915
And an offset vector
Figure FDA0003669043640000095
Sharing parameters at each current sequence moment;
s35.2, at each current sequence moment t of sequence traversal, inputting the current candidate word vector
Figure FDA0003669043640000096
Defining an update gate zt, calculating the output in the history hidden state
Figure FDA0003669043640000097
The update probability generated under the action of (1);
Figure FDA0003669043640000098
wherein, the weight matrix
Figure FDA0003669043640000099
And an offset vector
Figure FDA00036690436400000910
Sharing parameters at each current sequence moment;
s35.3, at each current sequence moment t of sequence traversal, inputting the current candidate word vector
Figure FDA00036690436400000911
Strengthening output in history hidden state by resetting gate rt
Figure FDA00036690436400000912
Energy value produced under action
Figure FDA00036690436400000913
Figure FDA00036690436400000914
Wherein, the weight matrix
Figure FDA0003669043640000101
And an offset vector
Figure FDA0003669043640000102
Sharing parameters at each current sequence moment;
s35.4, at each current sequence moment t of sequence traversal, calculating the current input candidate word vector through an update gate zt
Figure FDA0003669043640000103
Hidden state output of received history
Figure FDA0003669043640000104
Hidden state output ht under influence:
Figure FDA0003669043640000105
8. the method for Chinese word segmentation based on deep learning of claim 6, wherein the step S39 specifically comprises the following steps:
s39.1, for a given training sentence sequence sen [1: n ], generating a word sequence y [1: m ] through model prediction, wherein a word sequence score function after word segmentation is as follows:
Figure FDA0003669043640000106
s39.2, training by adopting a maximum interval method, and giving a training sentence sequence sen [1: n ]]The correct sequence of parts-of-words is denoted y (i),t The word sequence predicted by the model is expressed as
Figure FDA0003669043640000107
The structured interval loss is defined as:
Figure FDA0003669043640000108
wherein μ is an attenuation parameter;
s39.3, a training set beta is given, a loss function of 2 norm terms is added, and parameters are updated reversely through the loss function:
Figure FDA0003669043640000109
wherein:
Figure FDA00036690436400001010
CN201810756452.0A 2018-07-11 2018-07-11 Chinese word segmentation method based on deep learning Active CN109086267B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810756452.0A CN109086267B (en) 2018-07-11 2018-07-11 Chinese word segmentation method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810756452.0A CN109086267B (en) 2018-07-11 2018-07-11 Chinese word segmentation method based on deep learning

Publications (2)

Publication Number Publication Date
CN109086267A CN109086267A (en) 2018-12-25
CN109086267B true CN109086267B (en) 2022-07-26

Family

ID=64837409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810756452.0A Active CN109086267B (en) 2018-07-11 2018-07-11 Chinese word segmentation method based on deep learning

Country Status (1)

Country Link
CN (1) CN109086267B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284358B (en) * 2018-09-05 2020-08-28 普信恒业科技发展(北京)有限公司 Chinese address noun hierarchical method and device
CN109543764B (en) * 2018-11-28 2023-06-16 安徽省公共气象服务中心 Early warning information validity detection method and detection system based on intelligent semantic perception
CN110059188B (en) * 2019-04-11 2022-06-21 四川黑马数码科技有限公司 Chinese emotion analysis method based on bidirectional time convolution network
CN110222329B (en) * 2019-04-22 2023-11-24 平安科技(深圳)有限公司 Chinese word segmentation method and device based on deep learning
CN110334338B (en) * 2019-04-29 2023-09-19 北京小米移动软件有限公司 Word segmentation method, device and equipment
CN110263320B (en) * 2019-05-05 2020-12-11 清华大学 Unsupervised Chinese word segmentation method based on special corpus word vectors
CN110287961B (en) * 2019-05-06 2024-04-09 平安科技(深圳)有限公司 Chinese word segmentation method, electronic device and readable storage medium
CN110413773B (en) * 2019-06-20 2023-09-22 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN110287180B (en) * 2019-06-25 2021-12-28 上海诚数信息科技有限公司 Wind control modeling method based on deep learning
CN110516229B (en) * 2019-07-10 2020-05-05 杭州电子科技大学 Domain-adaptive Chinese word segmentation method based on deep learning
CN110502746B (en) * 2019-07-18 2021-04-09 北京捷通华声科技股份有限公司 Online domain updating decoding method and device, electronic equipment and storage medium
CN110489555B (en) * 2019-08-21 2022-03-08 创新工场(广州)人工智能研究有限公司 Language model pre-training method combined with similar word information
CN111126037B (en) * 2019-12-18 2021-10-29 昆明理工大学 Thai sentence segmentation method based on twin cyclic neural network
CN111160009B (en) * 2019-12-30 2020-12-08 北京理工大学 Sequence feature extraction method based on tree-shaped grid memory neural network
CN111274801A (en) * 2020-02-25 2020-06-12 苏州跃盟信息科技有限公司 Word segmentation method and device
CN111816169B (en) * 2020-07-23 2022-05-13 思必驰科技股份有限公司 Method and device for training Chinese and English hybrid speech recognition model
CN112036183B (en) * 2020-08-31 2024-02-02 湖南星汉数智科技有限公司 Word segmentation method, device, computer device and computer storage medium based on BiLSTM network model and CRF model
CN112150251A (en) * 2020-10-09 2020-12-29 北京明朝万达科技股份有限公司 Article name management method and device
CN113781139A (en) * 2020-10-19 2021-12-10 北京沃东天骏信息技术有限公司 Item recommendation method, item recommendation device, equipment and medium
CN112416931A (en) * 2020-11-18 2021-02-26 脸萌有限公司 Information generation method and device and electronic equipment
CN112559729B (en) * 2020-12-08 2022-06-24 申德周 Document abstract calculation method based on hierarchical multi-dimensional transformer model
CN112860889A (en) * 2021-01-29 2021-05-28 太原理工大学 BERT-based multi-label classification method
CN112905591B (en) * 2021-02-04 2022-08-26 成都信息工程大学 Data table connection sequence selection method based on machine learning
CN113361238B (en) * 2021-05-21 2022-02-11 北京语言大学 Method and device for automatically proposing question by recombining question types with language blocks
CN114638222B (en) * 2022-05-17 2022-08-16 天津卓朗科技发展有限公司 Natural disaster data classification method and model training method and device thereof
CN115455987B (en) * 2022-11-14 2023-05-05 合肥高维数据技术有限公司 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268444B (en) * 2018-01-10 2021-11-02 南京邮电大学 Chinese word segmentation method based on bidirectional LSTM, CNN and CRF

Also Published As

Publication number Publication date
CN109086267A (en) 2018-12-25

Similar Documents

Publication Publication Date Title
CN109086267B (en) Chinese word segmentation method based on deep learning
CN109344391B (en) Multi-feature fusion Chinese news text abstract generation method based on neural network
CN108460013B (en) Sequence labeling model and method based on fine-grained word representation model
CN109657239B (en) Chinese named entity recognition method based on attention mechanism and language model learning
CN109635124B (en) Remote supervision relation extraction method combined with background knowledge
CN107844469B (en) Text simplification method based on word vector query model
CN109800437B (en) Named entity recognition method based on feature fusion
CN109858041B (en) Named entity recognition method combining semi-supervised learning with user-defined dictionary
CN110717334A (en) Text emotion analysis method based on BERT model and double-channel attention
CN110196980B (en) Domain migration on Chinese word segmentation task based on convolutional network
CN111046179B (en) Text classification method for open network question in specific field
CN106569998A (en) Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN107818084B (en) Emotion analysis method fused with comment matching diagram
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN110046223B (en) Film evaluation emotion analysis method based on improved convolutional neural network model
CN113190656B (en) Chinese named entity extraction method based on multi-annotation frame and fusion features
CN111581970B (en) Text recognition method, device and storage medium for network context
CN111666758A (en) Chinese word segmentation method, training device and computer readable storage medium
CN111209749A (en) Method for applying deep learning to Chinese word segmentation
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN112163089A (en) Military high-technology text classification method and system fusing named entity recognition
Chen et al. Chinese Weibo sentiment analysis based on character embedding with dual-channel convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant