CN109766523A - Part-of-speech tagging method and labeling system - Google Patents
Part-of-speech tagging method and labeling system Download PDFInfo
- Publication number
- CN109766523A CN109766523A CN201711095902.8A CN201711095902A CN109766523A CN 109766523 A CN109766523 A CN 109766523A CN 201711095902 A CN201711095902 A CN 201711095902A CN 109766523 A CN109766523 A CN 109766523A
- Authority
- CN
- China
- Prior art keywords
- model
- input text
- word
- bgru
- cnn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Machine Translation (AREA)
Abstract
The present invention provides a kind of part-of-speech tagging method and labeling system, this method comprises: step A-1: carrying out subordinate sentence, participle to text to be marked, forms the first input text;Whether step A-2: including rare word in detection the first input text, if it is, the rare word of the first input text is replaced with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input text;Step A-3: being term vector V1 by the first input text conversion, is term vector V2 by the second input text conversion;Step A-4: V1 is inputted into CNN model, exports word feature vector V1 ';Step A-5: V2 is inputted into BGRU model, exports word feature vector V2 ';Step A-6: connection V1, V1 ' and V2 ' obtain V3, by V3 input BLSTM model, and by the output result of BLSTM model input CRF model, CRF model export all participles of text to be marked part of speech label.The part-of-speech tagging accuracy rate of normal word and rare word can be improved in part-of-speech tagging method provided by the invention.
Description
Technical field
The present invention relates to artificial intelligence field, in particular to a kind of part-of-speech tagging method and labeling system.
Background technique
Part-of-speech tagging (part-of-speech tagging), abbreviation POS sentence each word in given sentence sequence
Determine part of speech and marked, it is the foundation stone for going deep into processing natural language processing, is machine translation, speech recognition, information retrieval
Contour level task provides the foundation.
With the development of nerual network technique, new model is constantly suggested, and the introducing of neural network is so that part of speech
The accuracy rate of mark is further promoted.Wherein, Yoav Goldber is based on BLSTM (bidirectional long short-
Term memory) model to the part-of-speech tagging of rare word and unregistered word carry out research make progress.Nowadays, in part-of-speech tagging
Field, the model being widely used are CNN (convolutional neural networks)+BLSTM+CRF
(conditional random field algorithm) model.
But CNN+BLSTM+CRF model, it is lower for the mark accuracy rate of rare word and unregistered word, wherein rare word refers to
The lower word of the frequency of occurrences in corpus.
CNN+BLSTM+CRF model by normal word with rare word feature is indiscriminate is read out together, and rare word
Part of speech be often gathered in the limited part of speech such as noun, therefore will affect the part-of-speech tagging accuracy rate of rare word Yu normal word.
Summary of the invention
The present invention provides a kind of part-of-speech tagging method and labeling systems, and the part of speech mark of normal word and rare word can be improved
Infuse accuracy rate.
The present invention provides a kind of part-of-speech tagging method, includes convolutional neural networks CNN model, bidirectional gate cycling element BGRU
Model, two-way length Memorability network B LSTM model and condition random field CRF model, method includes the following steps:
Step A-1: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the dilute of the first input text
There is word to replace with preset characters, form the second input text, if it is not, then the second input text is enabled to be equal to the first input text;
Step A-3: being term vector V1 by the first input text conversion, is term vector V2 by the second input text conversion;
Step A-4: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
Step A-5: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Step A-6: connection V1, V1 ' and V2 ' obtain V3, V3 is inputted into BLSTM model, and by the output knot of BLSTM model
Fruit inputs CRF model, and CRF model exports the part of speech label of all participles of text to be marked.
The invention also includes a kind of rare word part of speech feature separation methods, comprising:
Step A-1: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the dilute of the first input text
There is word to replace with preset characters, form the second input text, if it is not, then the second input text is enabled to be equal to the first input text;
Step A-3: being term vector V1 by the first input text conversion, is term vector V2 by the second input text conversion;
Step A-4: V1 is inputted into convolutional neural networks CNN model, CNN model exports word feature vector V1 ';
Step A-5: V2 is inputted into bidirectional gate cycling element BGRU model, BGRU model exports word feature vector V2 ';
Step B: connection V1, V1 ' and V2 ' obtain V3, the vector location in V3 comprising preset characters be rare word feature to
Unit is measured, the vector location that preset characters are not included in V3 is normal word feature vector unit.
The invention also includes a kind of training methods for part-of-speech tagging model, which includes convolutional Neural
Network C NN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF
Model;
Training method includes:
Step C-1: the sample data of training corpus is converted into the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the dilute of the first input text
There is word to replace with preset characters, form the second input text, if it is not, then the second input text is enabled to be equal to the first input text;
Step A-3: being term vector V1 by the first input text conversion, is term vector V2 by the second input text conversion;
Step A-4: V1 is inputted into CNN model, CNN mould exports word feature vector V1 ';
Step A-5: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Step A-6: connection V1, V1 ' and V2 ' obtain V3, V3 is inputted into BLSTM model, and by the output knot of BLSTM model
Fruit inputs CRF model, and CRF model exports the part of speech label of all participles of training corpus sample data;
Step C-2: the mistake between the part of speech label of CRF model output and the part of speech label of training corpus sample data is calculated
Difference, according to error update CNN, BGRU, BLSTM and CRF model.
The invention also includes a kind of part-of-speech tagging systems, comprising:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to
Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
BLSTM model: inputting BLSTM model for V3, and the output result of BLSTM model inputted CRF model,
CRF model: CRF model exports the part of speech label of all participles of text to be marked.
The invention also includes a kind of rare word part of speech feature separation systems, comprising:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to
Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
Characteristic separation module: the vector location in V3 comprising preset characters is rare word feature vector unit, is not wrapped in V3
Vector location containing preset characters is normal word feature vector unit.
The invention also includes a kind of training systems for part-of-speech tagging model, comprising:
Text conversion module: the sample data of training corpus is converted into the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to
Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
BLSTM model: V3 is inputted into BLSTM model, and the output result of BLSTM model is inputted into CRF model;
CRF model: CRF model exports the part of speech label of all participles of training corpus sample data.
Update module: it calculates between the part of speech label of CRF model output and the part of speech label of training corpus sample data
Error, according to error update CNN, BGRU, BLSTM and CRF model.
Part-of-speech tagging method of the invention increases BGRU model on the basis of CNN+BLSTM+CRF model, compared to back
The case where only CNN model that scape technology is previously mentioned, increased BGRU improves the extraction accuracy of the part of speech feature of normal word,
The input value of BLSTM+CRF includes the output of CNN and BGRU simultaneously, because the marker characteristic comprising rare word is (pre- in BGRU output
If character), allow BLSTM+CRF to isolate rare word and normal word, can further improve to rare word and normal word
Learning effect and recognition effect.
Detailed description of the invention
Fig. 1 is list LSTM network structure;
Fig. 2 is GRU network structure;
Fig. 3 is GRU model neuron state Computational frame figure;
Fig. 4 is part-of-speech tagging method flow diagram of the present invention;
Fig. 5 is the neural network of CNN+BLSTM+CRF model;
Fig. 6 is the neural network of invention CNN+BGRU+BLSTM+CRF model;
Fig. 7 is the rare word part of speech feature separation method flow chart of the present invention;
Fig. 8 is the training method flow chart of part-of-speech tagging model of the present invention;
Fig. 9 present invention is part-of-speech tagging system construction drawing;
Figure 10 is the rare word part of speech feature separation system structure chart of the present invention;
Figure 11 is the training system structure chart of part-of-speech tagging model of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments
The present invention is described in detail.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be for distinguishing similar object, without for describing specific sequence and precedence.It should be understood that making in this way
Data are interchangeable under appropriate circumstances, so that the embodiment of the present invention described herein can be in addition to illustrating herein
Or the sequence other than those of description is implemented.
In part-of-speech tagging field, artificial neural network is recognition result corresponding to text input for part of speech input.People
Artificial neural networks generate mapping by study between input model and output mode, and export the study knot for indicating the mapping
Fruit.Based on learning outcome, artificial neural network is generated for the output for having the input model for being ready to use in study.
Part-of-speech tagging method of the invention, comprising convolutional neural networks CNN model, bidirectional gate cycling element BGRU model,
Two-way length Memorability network B LSTM model and condition random field CRF model are to 4 model introductions below:
Convolutional neural networks CNN model, is usually used to and does feature extraction work, and the conventional part of the model mainly includes
Input layer, convolutional layer, pond (Pool) layer and output layer.
Input layer can be initial data, be also possible to characteristic pattern.And convolutional layer then includes the convolution that can learn
Core and activation primitive.It inputs information and convolution kernel carries out convolution algorithm, convolution results are then inputted into activation primitive, export feature
Figure, therefore the layer is also feature extraction layer.Input signal is divided into nonoverlapping region by pond layer, carries out pond to each region
Change operation.Pond operation is commonly used for maximum value pondization and mean value pond.The operation can be used to eliminate the offset and distortion of signal.
CNN model generallys use multiple convolution layer and the alternate depth network structure of pond layer.The full articulamentum of CNN model rolls up multilayer
Successively group is combined into one group of signal to multiple groups feature after product pond operation.And the label probability distribution based on input is obtained, to mention
The internal information of words and phrases is taken, the character representation based on word is generated.
Two-way shot and long term memory network BLSTM model is different from LSTM network, and there are two contrary for BLSTM model
LSTM layers parallel, their structures having the same only read the sequence difference of text.Shown in single LSTM network structure Fig. 1.
The memory unit of BLSTM model mainly includes three kinds of gate cells, and whether sigmoid input gate can determine input value
Current state can be added to.State cell has linear self-loopa, its weight is controlled by forgetting door.The output of cell can be with
It is closed by out gate.
Each more new formula is summarized as:
it=σ (Wiht-1+Uiαt+bi)it=σ (Wiht-1+Uiαt+bi)
ft=σ (Wfht-1+Ufαt+bf)
ot=σ (Woht-1+Uoαt+bo)
ht=ot*tanh(ct)
Wherein σ indicates sigmoid activation primitive, αtFor the input vector of t moment, htRepresent hidden state, Ui, Uf, Uc, Uo
Respectively xiDifferent weight matrixs.And Wi, Wf, Wc, WoFor htNot fellow disciple weight matrix, bi, bf, bc, boIt is inclined for each door
It sets, it, ft, ct, otIt then respectively represents input gate, forget door, memory unit and out gate.
The output of BLSTM are as follows:
yt=[hft, hbt]
The full articulamentum of final BLSTM model is output layer.
Output of the CRF as part of speech label.Enable x={ x1..., xnIndicate list entries, xiIt indicates i-th in list entries
The vector of a word.Y={ y1..., ynIndicate output sequence part of speech label, y indicate x sequence label set.CRF is defined
A series of conditional probability p (y | z;W, b):
Wherein,For potential function, W and b are weight and bias vector.To mould
When type is trained, CRF network is optimized by optimizing the negative log-likelihood function of CRF.
Loss function of the negative log-likelihood function of CRF as model:
LCRF(W, b)=- ∑ilog(p(y|z;W, b)
The likelihood function expresses under different parameter vectors, observes data probability of occurrence.In the hypothesis of Gaussian noise
Under, it minimizes negative log-likelihood function and is equivalent to minimize error of sum square function, is i.e. minimum model predication value and actual value
Between difference.
BGRU (bidirectional gated recurrent unit) model, it is a kind of improvement of LSTM model.
As shown in Fig. 2, r, z indicate two kinds of door machine systems of resetting and update in GRU model.Pass through the optimization of door machine, BGRU model ginseng
Quantity is less, while guaranteeing modelling effect, simplifies model, the present invention is extracted feature using BGRU and effectively mentioning
Under the premise of taking feature, the population parameter of model is reduced to the greatest extent.
For GRU model neuron state Computational frame as shown in figure 3, in t moment, the state in GRU passes through following equation meter
It calculates:
Zt=σ (Wz*[ht-1, xt])
rt=σ (W* [ht-1, xt])
ht=tanh (W* [rt⊙ht-1, xt])
ht=(1-zt)⊙ht-1+zt⊙ht
Wherein zt, rtIt is to update to be multiplied with resetting door machine function, ⊙ representing matrix corresponding element respectively, σ indicates sigmod
Function, W indicate the shared parameter of GRU model.
Assuming that a sentence SiThere is T word, each word isBy sentence SiRegard a sequence as, in sentence
Word be sentence sequence component part.So pass through the preceding expression that sentence can be obtained to GRU and backward GRU model respectively:
Pass through combination Obtain sentence SiSemantic expressiveness:
Part-of-speech tagging method proposed by the invention increases BGRU mould on the basis of CNN+BLSTM+CRF model
Type, specific algorithm are as shown in Figure 4, comprising the following steps:
Step A-1 (S101): carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether step A-2 (S102): including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Step A-3 (S103): being term vector V1 by the first input text conversion, is term vector by the second input text conversion
V2;
Step A-4 (S104): V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
Step A-5 (S105): V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Step A-6 (S106): connection V1, V1 ' and V2 ' obtain V3, V3 is inputted into BLSTM model, and by BLSTM model
It exports result and inputs CRF model, CRF model exports the part of speech label of all participles of text to be marked.
Part-of-speech tagging method of the invention, on the basis of CNN+BLSTM+CRF model, compared to before use CNN merely
To the universal model that term vector is scanned, the present invention adds additional for the rare word problem in part-of-speech tagging field
BGRU network is handled by pretreated normal word term vector, obtains the part of speech feature of normal word.Fig. 5 is the prior art
The neural network of CNN+BLSTM+CRF model, Fig. 6 are the neural network of CNN+BGRU+BLSTM+CRF model of the present invention.
BGRU has a forward direction GRU and reversed GRU, positive GRU positive can capture article information simultaneously in hidden layer, instead
Capture article information is then upsided down the other way around to GRU, can capture more characteristic informations relative to unilateral network in this way.Together
Rare word part is removed in the input of Shi Yinwei BGRU, and rare word removal can be weakened by extracting feature using BGRU so then
Bring discontinuity influences, the part of speech feature of the normal word of extraction of maximum possible.
Then again with CNN to word feature and text term vector connect after input as BLSTM+CRF network.Phase
There was only the case where CNN model than what background technique was previously mentioned, the extraction that increased BGRU improves the part of speech feature of normal word is quasi-
Exactness, while the input value of BLSTM+CRF includes the output of CNN+BGRU, because including the marker characteristic of rare word in BGRU output
(preset characters) allow BLSTM+CRF to isolate rare word and normal word, can further improve to rare word and normal
The learning effect and recognition effect of word.
Further, in part-of-speech tagging method of the invention, the preset characters in step A-2 can be null character, can
To be legal vector character defined in 0 or NaN or programming language, or using pre-defined character.Preset characters can be with
BLSTM+CRF is set to identify rare word and normal word.
In method and step A-3 of the invention, text vector tool such as Word2Vec can be used, inputs text for first
Term vector V1 is converted to, the second input is converted into term vector V2 herein.The term vector being embedded in by Word2Vec can be with
Contacting between effective expression word and word is that word is embedded in one of common algorithm.
In the method for the invention, CNN model optimization uses the operation of maximum value pond, and the extraction part of speech of maximum possible is special
Sign.
In method and step A-2 of the invention, the decision condition of rare word are as follows: in reference corpus, frequency of occurrence is lower than
Preset value.Preset value is rule of thumb set, the number within generally 1-6.
It can be PFR corpus with reference to corpus.
PFR corpus is to have carried out word segmentation and part-of-speech tagging to the plain text corpus in People's Daily's first half of the year in 1998
It is made, in strict accordance with the date of People's Daily, version sequence, article sequential organization.Each word in article has word
Property label.Have in current label sets 26 basic word mark (noun n, time word t, place word s, noun of locality f, number m,
Quantifier q, it distinction word b, pronoun r, verb v, adjective a, descriptive word z, adverbial word d, preposition p, conjunction c, auxiliary word u, modal particle y, sighs
Word e, onomatopoeia o, Chinese idiom i, idiom l, abbreviation j, enclitics h, ingredient k, morpheme g, non-morpheme word x, punctuate symbol are followed by
Number w) outside, the angle applied from corpus, increases proper noun (name nr, place name ns, organization names nt, other proprietary names
Word nz);Some labels are also increased from linguistics angle, have used more than 40 label in total.
Optionally, in method and step A-2 of the invention, the decision condition of rare word are as follows: do not appear in normal word dictionary
In word be rare word.Normal word dictionary includes all normal words.
The invention also includes a kind of rare word part of speech feature separation methods, as shown in fig. 7, comprises following steps:
Step A-1 (S201): carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether step A-2 (S202): including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Step A-3 (S203): being term vector V1 by the first input text conversion, is term vector by the second input text conversion
V2;
Step A-4 (S204): V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
Step A-5 (S205): V2 is inputted into BGRU model, CNN model exports word feature vector V2 ';
Step B (S206): connection V1, V1 ' and V2 ' obtain V3, the vector location in V3 comprising preset characters is rare word
Feature vector units, the vector location that preset characters are not included in V3 is normal word feature vector unit.
Preferentially, in Fig. 7, CNN model uses the operation of maximum value pond.
Preferentially, in Fig. 7, the decision condition of rare word are as follows: in reference corpus, frequency of occurrence is lower than preset value.
V3, rare word feature vector unit and the normal word feature vector unit that Fig. 7 step B of the present invention is obtained can be used for it
His neural network model, to improve the normal word of other models and the part-of-speech tagging accuracy rate of rare word.
Further, the decision condition of rare word may be set in Fig. 7: in reference corpus, frequency of occurrence is lower than pre-
If value.
The invention also includes a kind of training methods for part-of-speech tagging model, which includes convolutional Neural
Network C NN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF
Model.
As shown in figure 8, the training method includes:
Step C-1 (S301): the sample data of training corpus is converted into the first input text;
Whether step A-2 (S302): including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Step A-3 (S303): being term vector V1 by the first input text conversion, is term vector by the second input text conversion
V2;
Step A-4 (S304): V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
Step A-5 (S305): V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Step A-6 (S306): connection V1, V1 ' and V2 ' obtain V3, V3 is inputted into BLSTM model, and by BLSTM model
It exports result and inputs CRF model, CRF model exports the part of speech label of all participles of training corpus sample data;
Step C-2 (S307): the part of speech of the part of speech label and training corpus sample data that calculate the output of CRF model marks it
Between error, according to error update CNN, BGRU, BLSTM and CRF model.
The preferred PFR corpus of training corpus in Fig. 8.
In the S307 (step C-2) of Fig. 8, when updating CNN, BGRU, BLSTM and CRF model, Adam algorithm can be used
The renewal process of Controlling model.
Adam (adaptive moment estimation) algorithm full name is adaptive moments estimation algorithm, in probability theory
" square " is meant that: if a stochastic variable X obeys some distribution, the first moment of X is E (X), that is, sample mean, X
Second moment with regard to E (X2), that is, the average value of sample square.Adam algorithm is according to loss function to the gradient of each parameter
Single order moments estimation and second order moments estimation, dynamic adjustment are directed to the learning rate of each parameter.Adam is declined based on gradient
Method, but the Learning Step of iterative parameter has a determining range every time, will not cause because of very big gradient very big
Learning Step, the value of parameter is more stable.Preferentially, the present invention is set as every 3000 step and carries out primary learning rate index declining
Subtract, decaying radix is 0.1, remaining parameter uses default setting.
When training pattern, the hyper parameter of setting CNN model and BLSTM model is also needed.For CNN model, hyper parameter includes
The window size of filter is preferentially set as 2*50;Filter quantity, is preferentially set as 50.For BLSTM model, hide
The hyper parameter of unit includes quantity and the number of plies, it may be preferable that quantity is set as 100, and the number of plies is set as 2 layers.
In addition, the present invention uses dropout technology also to reduce over-fitting situation, it may be preferable that dropout rate is 0.5.
The present invention can be such that loss function restrains, make simultaneously by the way that the parameters such as the above learning rate, CNN window size are arranged
Obtain network has optimal performance under this architecture.
For example: learning parameter of the invention may be configured as, and batch size is 64 to accelerate convergence rate.Initial study
Rate size is 0.01, attenuation rate 0.1, and training total period is 20000 times, to obtain the preferable part-of-speech tagging of learning effect
(CNN+BGRU+BLSTM+CRF) model.
The embodiment of Fig. 9 of the present invention gives a kind of part-of-speech tagging system, includes convolutional neural networks CNN model, two-way
Door cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF model, the system are also wrapped
It includes:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Term vector generation module: being term vector V1 by the first input text conversion, is term vector by two input text conversions
V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
BLSTM model: inputting BLSTM model for V3, and the output result of BLSTM model inputted CRF model,
CRF model: CRF model exports the part of speech label of all participles of text to be marked.
The operation of maximum value pond can be used in CNN model in Fig. 9.
In Fig. 9, the decision condition of rare word be may be configured as: in reference corpus, frequency of occurrence is lower than preset value.
Part-of-speech tagging system of the invention increases BGRU model on the basis of CNN+BLSTM+CRF model, compared to back
The case where only CNN model that scape technology is previously mentioned, increased BGRU improves the extraction accuracy of the part of speech feature of normal word,
The input value of BLSTM+CRF includes the output of CNN and BGRU simultaneously, because the marker characteristic comprising rare word is (pre- in BGRU output
If character), allow BLSTM+CRF to isolate rare word and normal word, can further improve to rare word and normal word
Learning effect and recognition effect.
The embodiment of Figure 10 of the present invention gives a kind of rare word part of speech feature separation system, includes convolutional neural networks
CNN model, bidirectional gate cycling element BGRU model, the system further include:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to
Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
Characteristic separation module: the vector location in V3 comprising preset characters is rare word feature vector unit, is not wrapped in V3
Vector location containing preset characters is normal word feature vector unit.
The operation of maximum value pond can be used in CNN model in Figure 10.
In Figure 10, the decision condition of rare word be may be configured as: in reference corpus, frequency of occurrence is lower than preset value.
The embodiment of Figure 11 of the present invention gives a kind of training system for part-of-speech tagging model, the part-of-speech tagging model
Including convolutional neural networks CNN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and
Condition random field CRF model;
The training system includes:
Text conversion module: the sample data of training corpus is converted into the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text
This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input
Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to
Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
BLSTM model: V3 is inputted into BLSTM model, and the output result of BLSTM model is inputted into CRF model;
CRF model: CRF model exports the part of speech label of all participles of training corpus sample data.
Update module: it calculates between the part of speech label of CRF model output and the part of speech label of training corpus sample data
Error, according to error update CNN, BGRU, BLSTM and CRF model.
It, can also be using Adam algorithm Controlling model more when updating CNN, BGRU, BLSTM and CRF model in Figure 11
New process.
The preferred PFR corpus of training corpus in Figure 11.
It should be noted that the embodiment of part-of-speech tagging system of the invention, the implementation with part-of-speech tagging method of the present invention
Example principle is identical, and related place can mutual reference.
The foregoing is merely illustrative of the preferred embodiments of the present invention, not to limit scope of the invention, it is all
Within the spirit and principle of technical solution of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this hair
Within bright protection scope.
Claims (18)
1. a kind of part-of-speech tagging method, which is characterized in that include convolutional neural networks CNN model, bidirectional gate cycling element BGRU
Model, two-way length Memorability network B LSTM model and condition random field CRF model, which comprises
Step A-1: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the first input text
Rare word replace with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to the first input text
This;
Step A-3: being term vector V1 by the first input text conversion, is term vector by the second input text conversion
V2;
Step A-4: the V1 is inputted into CNN model, the CNN model exports word feature vector V1 ';
Step A-5: the V2 is inputted into BGRU model, the BGRU model exports word feature vector V2 ';
Step A-6: described V1, V1 are connected ' and V2 ' obtain V3, the V3 is inputted into BLSTM model, and by the BLSTM model
Output result input CRF model, the CRF model export all participles of the text to be marked part of speech label.
2. the method according to claim 1, wherein the CNN model uses the operation of maximum value pond.
3. the method according to claim 1, wherein the decision condition of the rare word are as follows: in reference corpus,
Frequency of occurrence is lower than preset value.
4. a kind of rare word part of speech feature separation method, which is characterized in that the described method includes:
Step A-1: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the first input text
Rare word replace with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to the first input text
This;
Step A-3: being term vector V1 by the first input text conversion, is term vector by the second input text conversion
V2;
Step A-4: the V1 is inputted into convolutional neural networks CNN model, the CNN model exports word feature vector V1 ';
Step A-5: the V2 is inputted into bidirectional gate cycling element BGRU model, the BGRU model exports word feature vector V2 ';
Step B: described V1, V1 are connected ' and V2 ' obtain V3, the vector location in the V3 comprising the preset characters is rare
Word feature vector unit, the vector location that the preset characters are not included in the V3 is normal word feature vector unit.
5. according to the method described in claim 4, it is characterized in that, the CNN model uses the operation of maximum value pond.
6. according to the method described in claim 4, it is characterized in that, the decision condition of the rare word are as follows: in reference corpus,
Frequency of occurrence is lower than preset value.
7. a kind of training method for part-of-speech tagging model, which is characterized in that the part-of-speech tagging model includes convolutional Neural
Network C NN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF
Model;
The described method includes:
Step C-1: the sample data of training corpus is converted into the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the first input text
Rare word replace with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to the first input text
This;
Step A-3: being term vector V1 by the first input text conversion, is term vector by the second input text conversion
V2;
Step A-4: the V1 is inputted into CNN model, the CNN model exports word feature vector V1 ';
Step A-5: the V2 is inputted into BGRU model, the BGRU model exports word feature vector V2 ';
Step A-6: described V1, V1 are connected ' and V2 ' obtain V3, the V3 is inputted into BLSTM model, and by the BLSTM model
Output result input CRF model, the CRF model exports the part of speech mark of all participles of the training corpus sample data
Note;
Step C-2: it calculates between the part of speech label of the CRF model output and the part of speech label of the training corpus sample data
Error, according to CNN, BGRU, BLSTM and CRF model described in the error update.
8. the method according to the description of claim 7 is characterized in that being adopted when updating CNN, BGRU, BLSTM and CRF model
The renewal process of the model is controlled with Adam algorithm.
9. the method according to the description of claim 7 is characterized in that the training corpus is PFR corpus.
10. a kind of part-of-speech tagging system, which is characterized in that the system comprises:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, defeated by described first
The rare word for entering text replaces with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to first
Input text;
Term vector generation module: being term vector V1 by the first input text conversion, is by the second input text conversion
Term vector V2;
CNN model: the V1 is inputted into convolutional neural networks CNN model, the CNN model exports word feature vector V1 ';
BGRU model: the V2 is inputted into bidirectional gate cycling element BGRU model, the BGRU model exports word feature vector
V2';
Vector link block: described V1, V1 are connected ' and V2 ' obtain V3;
BLSTM model: inputting two-way length Memorability network B LSTM model for the V3, and by the output of the BLSTM model
As a result input condition random field CRF model,
CRF model: the CRF model exports the part of speech label of all participles of the text to be marked.
11. system according to claim 10, which is characterized in that the CNN model uses the operation of maximum value pond.
12. system according to claim 10, which is characterized in that the decision condition of the rare word are as follows: in reference corpus
In, frequency of occurrence is lower than preset value.
13. a kind of rare word part of speech feature separation system, which is characterized in that it is characterized in that, the system comprises:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, defeated by described first
The rare word for entering text replaces with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to first
Input text;
Term vector generation module: being term vector V1 by the first input text conversion, is by the second input text conversion
Term vector V2;
CNN model: the V1 is inputted into convolutional neural networks CNN model, the CNN model exports word feature vector V1 ';
BGRU model: the V2 is inputted into bidirectional gate cycling element BGRU model, the BGRU model exports word feature vector
V2';
Vector link block: described V1, V1 are connected ' and V2 ' obtain V3;
Characteristic separation module: the vector location in the V3 comprising the preset characters is rare word feature vector unit, described
Vector location in V3 not comprising the preset characters is normal word feature vector unit.
14. system according to claim 13, which is characterized in that the CNN model uses the operation of maximum value pond.
15. system according to claim 13, which is characterized in that the decision condition of the rare word are as follows: in reference corpus
In, frequency of occurrence is lower than preset value.
16. a kind of training system for part-of-speech tagging model, which is characterized in that the part-of-speech tagging model includes convolutional Neural
Network C NN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF
Model;
The system comprises:
Text conversion module: the sample data of training corpus is converted into the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, defeated by described first
The rare word for entering text replaces with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to first
Input text;
Term vector generation module: being term vector V1 by the first input text conversion, is by the second input text conversion
Term vector V2;
CNN model: the V1 is inputted into CNN model, the CNN model exports word feature vector V1 ';
BGRU model: the V2 is inputted into BGRU model, the BGRU model exports word feature vector V2 ';
Vector link block: described V1, V1 are connected ' and V2 ' obtain V3;
BLSTM model: the V3 is inputted into BLSTM model, and the output result of the BLSTM model is inputted into CRF model;
CRF model: the CRF model exports the part of speech label of all participles of the training corpus sample data.
Update module: the part of speech of the part of speech label and the training corpus sample data that calculate the CRF model output marks it
Between error, according to CNN, BGRU, BLSTM and CRF model described in the error update.
17. system according to claim 16, which is characterized in that when updating CNN, BGRU, BLSTM and CRF model,
The renewal process of the model is controlled using Adam algorithm.
18. according to the method for claim 16, which is characterized in that the training corpus is PFR corpus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711095902.8A CN109766523A (en) | 2017-11-09 | 2017-11-09 | Part-of-speech tagging method and labeling system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711095902.8A CN109766523A (en) | 2017-11-09 | 2017-11-09 | Part-of-speech tagging method and labeling system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109766523A true CN109766523A (en) | 2019-05-17 |
Family
ID=66449760
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711095902.8A Pending CN109766523A (en) | 2017-11-09 | 2017-11-09 | Part-of-speech tagging method and labeling system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109766523A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110377691A (en) * | 2019-07-23 | 2019-10-25 | 上海应用技术大学 | Method, apparatus, equipment and the storage medium of text classification |
CN110415683A (en) * | 2019-07-10 | 2019-11-05 | 上海麦图信息科技有限公司 | A kind of air control voice instruction recognition method based on deep learning |
CN111444723A (en) * | 2020-03-06 | 2020-07-24 | 深圳追一科技有限公司 | Information extraction model training method and device, computer equipment and storage medium |
CN111507104A (en) * | 2020-03-19 | 2020-08-07 | 北京百度网讯科技有限公司 | Method and device for establishing label labeling model, electronic equipment and readable storage medium |
CN112183086A (en) * | 2020-09-23 | 2021-01-05 | 北京先声智能科技有限公司 | English pronunciation continuous reading mark model based on sense group labeling |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010250814A (en) * | 2009-04-14 | 2010-11-04 | Nec (China) Co Ltd | Part-of-speech tagging system, training device and method of part-of-speech tagging model |
CN106960003A (en) * | 2017-02-15 | 2017-07-18 | 黑龙江工程学院 | Plagiarize the query generation method of the retrieval of the source based on machine learning in detection |
CN107291795A (en) * | 2017-05-03 | 2017-10-24 | 华南理工大学 | A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging |
-
2017
- 2017-11-09 CN CN201711095902.8A patent/CN109766523A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010250814A (en) * | 2009-04-14 | 2010-11-04 | Nec (China) Co Ltd | Part-of-speech tagging system, training device and method of part-of-speech tagging model |
CN106960003A (en) * | 2017-02-15 | 2017-07-18 | 黑龙江工程学院 | Plagiarize the query generation method of the retrieval of the source based on machine learning in detection |
CN107291795A (en) * | 2017-05-03 | 2017-10-24 | 华南理工大学 | A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging |
Non-Patent Citations (1)
Title |
---|
胡婕等: "双向循环网络中文分词模型", 《小型微型计算机系统》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110415683A (en) * | 2019-07-10 | 2019-11-05 | 上海麦图信息科技有限公司 | A kind of air control voice instruction recognition method based on deep learning |
CN110377691A (en) * | 2019-07-23 | 2019-10-25 | 上海应用技术大学 | Method, apparatus, equipment and the storage medium of text classification |
CN111444723A (en) * | 2020-03-06 | 2020-07-24 | 深圳追一科技有限公司 | Information extraction model training method and device, computer equipment and storage medium |
CN111507104A (en) * | 2020-03-19 | 2020-08-07 | 北京百度网讯科技有限公司 | Method and device for establishing label labeling model, electronic equipment and readable storage medium |
JP2021149916A (en) * | 2020-03-19 | 2021-09-27 | ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド | Method for establishing label labeling model, device, electronic equipment, program, and readable storage medium |
KR20210118360A (en) * | 2020-03-19 | 2021-09-30 | 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. | Method, apparatus, electronic device, program and readable storage medium for creating a label marking model |
JP7098853B2 (en) | 2020-03-19 | 2022-07-12 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Methods for establishing label labeling models, devices, electronics, programs and readable storage media |
US11531813B2 (en) | 2020-03-19 | 2022-12-20 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method, electronic device and readable storage medium for creating a label marking model |
KR102645185B1 (en) * | 2020-03-19 | 2024-03-06 | 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. | Method, apparatus, electronic device, program and readable storage medium for creating a label marking model |
CN112183086A (en) * | 2020-09-23 | 2021-01-05 | 北京先声智能科技有限公司 | English pronunciation continuous reading mark model based on sense group labeling |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108628823B (en) | Named entity recognition method combining attention mechanism and multi-task collaborative training | |
CN107992597B (en) | Text structuring method for power grid fault case | |
CN110222349B (en) | Method and computer for deep dynamic context word expression | |
CN109766523A (en) | Part-of-speech tagging method and labeling system | |
WO2018028077A1 (en) | Deep learning based method and device for chinese semantics analysis | |
CN110083831A (en) | A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF | |
CN110555084B (en) | Remote supervision relation classification method based on PCNN and multi-layer attention | |
CN111966812B (en) | Automatic question answering method based on dynamic word vector and storage medium | |
CN107797987B (en) | Bi-LSTM-CNN-based mixed corpus named entity identification method | |
CN112199945A (en) | Text error correction method and device | |
CN112818118B (en) | Reverse translation-based Chinese humor classification model construction method | |
CN113268974B (en) | Method, device and equipment for marking pronunciations of polyphones and storage medium | |
CN110134950B (en) | Automatic text proofreading method combining words | |
CN110826334A (en) | Chinese named entity recognition model based on reinforcement learning and training method thereof | |
CN113220876B (en) | Multi-label classification method and system for English text | |
CN107977353A (en) | A kind of mixing language material name entity recognition method based on LSTM-CNN | |
CN109492223A (en) | A kind of Chinese missing pronoun complementing method based on ANN Reasoning | |
Yang et al. | Recurrent neural network-based language models with variation in net topology, language, and granularity | |
CN107797988A (en) | A kind of mixing language material name entity recognition method based on Bi LSTM | |
CN115600597A (en) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium | |
CN107992468A (en) | A kind of mixing language material name entity recognition method based on LSTM | |
Chan et al. | Applying and optimizing NLP model with CARU | |
CN114357166B (en) | Text classification method based on deep learning | |
CN114386425B (en) | Big data system establishing method for processing natural language text content | |
CN115659981A (en) | Named entity recognition method based on neural network model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190517 |
|
WD01 | Invention patent application deemed withdrawn after publication |