CN109766523A - Part-of-speech tagging method and labeling system - Google Patents

Part-of-speech tagging method and labeling system Download PDF

Info

Publication number
CN109766523A
CN109766523A CN201711095902.8A CN201711095902A CN109766523A CN 109766523 A CN109766523 A CN 109766523A CN 201711095902 A CN201711095902 A CN 201711095902A CN 109766523 A CN109766523 A CN 109766523A
Authority
CN
China
Prior art keywords
model
input text
word
bgru
cnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711095902.8A
Other languages
Chinese (zh)
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Potevio Information Technology Co Ltd
Putian Information Technology Co Ltd
Original Assignee
Putian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Putian Information Technology Co Ltd filed Critical Putian Information Technology Co Ltd
Priority to CN201711095902.8A priority Critical patent/CN109766523A/en
Publication of CN109766523A publication Critical patent/CN109766523A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention provides a kind of part-of-speech tagging method and labeling system, this method comprises: step A-1: carrying out subordinate sentence, participle to text to be marked, forms the first input text;Whether step A-2: including rare word in detection the first input text, if it is, the rare word of the first input text is replaced with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input text;Step A-3: being term vector V1 by the first input text conversion, is term vector V2 by the second input text conversion;Step A-4: V1 is inputted into CNN model, exports word feature vector V1 ';Step A-5: V2 is inputted into BGRU model, exports word feature vector V2 ';Step A-6: connection V1, V1 ' and V2 ' obtain V3, by V3 input BLSTM model, and by the output result of BLSTM model input CRF model, CRF model export all participles of text to be marked part of speech label.The part-of-speech tagging accuracy rate of normal word and rare word can be improved in part-of-speech tagging method provided by the invention.

Description

Part-of-speech tagging method and labeling system
Technical field
The present invention relates to artificial intelligence field, in particular to a kind of part-of-speech tagging method and labeling system.
Background technique
Part-of-speech tagging (part-of-speech tagging), abbreviation POS sentence each word in given sentence sequence Determine part of speech and marked, it is the foundation stone for going deep into processing natural language processing, is machine translation, speech recognition, information retrieval Contour level task provides the foundation.
With the development of nerual network technique, new model is constantly suggested, and the introducing of neural network is so that part of speech The accuracy rate of mark is further promoted.Wherein, Yoav Goldber is based on BLSTM (bidirectional long short- Term memory) model to the part-of-speech tagging of rare word and unregistered word carry out research make progress.Nowadays, in part-of-speech tagging Field, the model being widely used are CNN (convolutional neural networks)+BLSTM+CRF (conditional random field algorithm) model.
But CNN+BLSTM+CRF model, it is lower for the mark accuracy rate of rare word and unregistered word, wherein rare word refers to The lower word of the frequency of occurrences in corpus.
CNN+BLSTM+CRF model by normal word with rare word feature is indiscriminate is read out together, and rare word Part of speech be often gathered in the limited part of speech such as noun, therefore will affect the part-of-speech tagging accuracy rate of rare word Yu normal word.
Summary of the invention
The present invention provides a kind of part-of-speech tagging method and labeling systems, and the part of speech mark of normal word and rare word can be improved Infuse accuracy rate.
The present invention provides a kind of part-of-speech tagging method, includes convolutional neural networks CNN model, bidirectional gate cycling element BGRU Model, two-way length Memorability network B LSTM model and condition random field CRF model, method includes the following steps:
Step A-1: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the dilute of the first input text There is word to replace with preset characters, form the second input text, if it is not, then the second input text is enabled to be equal to the first input text;
Step A-3: being term vector V1 by the first input text conversion, is term vector V2 by the second input text conversion;
Step A-4: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
Step A-5: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Step A-6: connection V1, V1 ' and V2 ' obtain V3, V3 is inputted into BLSTM model, and by the output knot of BLSTM model Fruit inputs CRF model, and CRF model exports the part of speech label of all participles of text to be marked.
The invention also includes a kind of rare word part of speech feature separation methods, comprising:
Step A-1: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the dilute of the first input text There is word to replace with preset characters, form the second input text, if it is not, then the second input text is enabled to be equal to the first input text;
Step A-3: being term vector V1 by the first input text conversion, is term vector V2 by the second input text conversion;
Step A-4: V1 is inputted into convolutional neural networks CNN model, CNN model exports word feature vector V1 ';
Step A-5: V2 is inputted into bidirectional gate cycling element BGRU model, BGRU model exports word feature vector V2 ';
Step B: connection V1, V1 ' and V2 ' obtain V3, the vector location in V3 comprising preset characters be rare word feature to Unit is measured, the vector location that preset characters are not included in V3 is normal word feature vector unit.
The invention also includes a kind of training methods for part-of-speech tagging model, which includes convolutional Neural Network C NN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF Model;
Training method includes:
Step C-1: the sample data of training corpus is converted into the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the dilute of the first input text There is word to replace with preset characters, form the second input text, if it is not, then the second input text is enabled to be equal to the first input text;
Step A-3: being term vector V1 by the first input text conversion, is term vector V2 by the second input text conversion;
Step A-4: V1 is inputted into CNN model, CNN mould exports word feature vector V1 ';
Step A-5: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Step A-6: connection V1, V1 ' and V2 ' obtain V3, V3 is inputted into BLSTM model, and by the output knot of BLSTM model Fruit inputs CRF model, and CRF model exports the part of speech label of all participles of training corpus sample data;
Step C-2: the mistake between the part of speech label of CRF model output and the part of speech label of training corpus sample data is calculated Difference, according to error update CNN, BGRU, BLSTM and CRF model.
The invention also includes a kind of part-of-speech tagging systems, comprising:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
BLSTM model: inputting BLSTM model for V3, and the output result of BLSTM model inputted CRF model,
CRF model: CRF model exports the part of speech label of all participles of text to be marked.
The invention also includes a kind of rare word part of speech feature separation systems, comprising:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
Characteristic separation module: the vector location in V3 comprising preset characters is rare word feature vector unit, is not wrapped in V3 Vector location containing preset characters is normal word feature vector unit.
The invention also includes a kind of training systems for part-of-speech tagging model, comprising:
Text conversion module: the sample data of training corpus is converted into the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
BLSTM model: V3 is inputted into BLSTM model, and the output result of BLSTM model is inputted into CRF model;
CRF model: CRF model exports the part of speech label of all participles of training corpus sample data.
Update module: it calculates between the part of speech label of CRF model output and the part of speech label of training corpus sample data Error, according to error update CNN, BGRU, BLSTM and CRF model.
Part-of-speech tagging method of the invention increases BGRU model on the basis of CNN+BLSTM+CRF model, compared to back The case where only CNN model that scape technology is previously mentioned, increased BGRU improves the extraction accuracy of the part of speech feature of normal word, The input value of BLSTM+CRF includes the output of CNN and BGRU simultaneously, because the marker characteristic comprising rare word is (pre- in BGRU output If character), allow BLSTM+CRF to isolate rare word and normal word, can further improve to rare word and normal word Learning effect and recognition effect.
Detailed description of the invention
Fig. 1 is list LSTM network structure;
Fig. 2 is GRU network structure;
Fig. 3 is GRU model neuron state Computational frame figure;
Fig. 4 is part-of-speech tagging method flow diagram of the present invention;
Fig. 5 is the neural network of CNN+BLSTM+CRF model;
Fig. 6 is the neural network of invention CNN+BGRU+BLSTM+CRF model;
Fig. 7 is the rare word part of speech feature separation method flow chart of the present invention;
Fig. 8 is the training method flow chart of part-of-speech tagging model of the present invention;
Fig. 9 present invention is part-of-speech tagging system construction drawing;
Figure 10 is the rare word part of speech feature separation system structure chart of the present invention;
Figure 11 is the training system structure chart of part-of-speech tagging model of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be for distinguishing similar object, without for describing specific sequence and precedence.It should be understood that making in this way Data are interchangeable under appropriate circumstances, so that the embodiment of the present invention described herein can be in addition to illustrating herein Or the sequence other than those of description is implemented.
In part-of-speech tagging field, artificial neural network is recognition result corresponding to text input for part of speech input.People Artificial neural networks generate mapping by study between input model and output mode, and export the study knot for indicating the mapping Fruit.Based on learning outcome, artificial neural network is generated for the output for having the input model for being ready to use in study.
Part-of-speech tagging method of the invention, comprising convolutional neural networks CNN model, bidirectional gate cycling element BGRU model, Two-way length Memorability network B LSTM model and condition random field CRF model are to 4 model introductions below:
Convolutional neural networks CNN model, is usually used to and does feature extraction work, and the conventional part of the model mainly includes Input layer, convolutional layer, pond (Pool) layer and output layer.
Input layer can be initial data, be also possible to characteristic pattern.And convolutional layer then includes the convolution that can learn Core and activation primitive.It inputs information and convolution kernel carries out convolution algorithm, convolution results are then inputted into activation primitive, export feature Figure, therefore the layer is also feature extraction layer.Input signal is divided into nonoverlapping region by pond layer, carries out pond to each region Change operation.Pond operation is commonly used for maximum value pondization and mean value pond.The operation can be used to eliminate the offset and distortion of signal. CNN model generallys use multiple convolution layer and the alternate depth network structure of pond layer.The full articulamentum of CNN model rolls up multilayer Successively group is combined into one group of signal to multiple groups feature after product pond operation.And the label probability distribution based on input is obtained, to mention The internal information of words and phrases is taken, the character representation based on word is generated.
Two-way shot and long term memory network BLSTM model is different from LSTM network, and there are two contrary for BLSTM model LSTM layers parallel, their structures having the same only read the sequence difference of text.Shown in single LSTM network structure Fig. 1.
The memory unit of BLSTM model mainly includes three kinds of gate cells, and whether sigmoid input gate can determine input value Current state can be added to.State cell has linear self-loopa, its weight is controlled by forgetting door.The output of cell can be with It is closed by out gate.
Each more new formula is summarized as:
it=σ (Wiht-1+Uiαt+bi)it=σ (Wiht-1+Uiαt+bi)
ft=σ (Wfht-1+Ufαt+bf)
ot=σ (Woht-1+Uoαt+bo)
ht=ot*tanh(ct)
Wherein σ indicates sigmoid activation primitive, αtFor the input vector of t moment, htRepresent hidden state, Ui, Uf, Uc, Uo Respectively xiDifferent weight matrixs.And Wi, Wf, Wc, WoFor htNot fellow disciple weight matrix, bi, bf, bc, boIt is inclined for each door It sets, it, ft, ct, otIt then respectively represents input gate, forget door, memory unit and out gate.
The output of BLSTM are as follows:
yt=[hft, hbt]
The full articulamentum of final BLSTM model is output layer.
Output of the CRF as part of speech label.Enable x={ x1..., xnIndicate list entries, xiIt indicates i-th in list entries The vector of a word.Y={ y1..., ynIndicate output sequence part of speech label, y indicate x sequence label set.CRF is defined A series of conditional probability p (y | z;W, b):
Wherein,For potential function, W and b are weight and bias vector.To mould When type is trained, CRF network is optimized by optimizing the negative log-likelihood function of CRF.
Loss function of the negative log-likelihood function of CRF as model:
LCRF(W, b)=- ∑ilog(p(y|z;W, b)
The likelihood function expresses under different parameter vectors, observes data probability of occurrence.In the hypothesis of Gaussian noise Under, it minimizes negative log-likelihood function and is equivalent to minimize error of sum square function, is i.e. minimum model predication value and actual value Between difference.
BGRU (bidirectional gated recurrent unit) model, it is a kind of improvement of LSTM model. As shown in Fig. 2, r, z indicate two kinds of door machine systems of resetting and update in GRU model.Pass through the optimization of door machine, BGRU model ginseng Quantity is less, while guaranteeing modelling effect, simplifies model, the present invention is extracted feature using BGRU and effectively mentioning Under the premise of taking feature, the population parameter of model is reduced to the greatest extent.
For GRU model neuron state Computational frame as shown in figure 3, in t moment, the state in GRU passes through following equation meter It calculates:
Zt=σ (Wz*[ht-1, xt])
rt=σ (W* [ht-1, xt])
ht=tanh (W* [rt⊙ht-1, xt])
ht=(1-zt)⊙ht-1+zt⊙ht
Wherein zt, rtIt is to update to be multiplied with resetting door machine function, ⊙ representing matrix corresponding element respectively, σ indicates sigmod Function, W indicate the shared parameter of GRU model.
Assuming that a sentence SiThere is T word, each word isBy sentence SiRegard a sequence as, in sentence Word be sentence sequence component part.So pass through the preceding expression that sentence can be obtained to GRU and backward GRU model respectively:
Pass through combination Obtain sentence SiSemantic expressiveness:
Part-of-speech tagging method proposed by the invention increases BGRU mould on the basis of CNN+BLSTM+CRF model Type, specific algorithm are as shown in Figure 4, comprising the following steps:
Step A-1 (S101): carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether step A-2 (S102): including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Step A-3 (S103): being term vector V1 by the first input text conversion, is term vector by the second input text conversion V2;
Step A-4 (S104): V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
Step A-5 (S105): V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Step A-6 (S106): connection V1, V1 ' and V2 ' obtain V3, V3 is inputted into BLSTM model, and by BLSTM model It exports result and inputs CRF model, CRF model exports the part of speech label of all participles of text to be marked.
Part-of-speech tagging method of the invention, on the basis of CNN+BLSTM+CRF model, compared to before use CNN merely To the universal model that term vector is scanned, the present invention adds additional for the rare word problem in part-of-speech tagging field BGRU network is handled by pretreated normal word term vector, obtains the part of speech feature of normal word.Fig. 5 is the prior art The neural network of CNN+BLSTM+CRF model, Fig. 6 are the neural network of CNN+BGRU+BLSTM+CRF model of the present invention.
BGRU has a forward direction GRU and reversed GRU, positive GRU positive can capture article information simultaneously in hidden layer, instead Capture article information is then upsided down the other way around to GRU, can capture more characteristic informations relative to unilateral network in this way.Together Rare word part is removed in the input of Shi Yinwei BGRU, and rare word removal can be weakened by extracting feature using BGRU so then Bring discontinuity influences, the part of speech feature of the normal word of extraction of maximum possible.
Then again with CNN to word feature and text term vector connect after input as BLSTM+CRF network.Phase There was only the case where CNN model than what background technique was previously mentioned, the extraction that increased BGRU improves the part of speech feature of normal word is quasi- Exactness, while the input value of BLSTM+CRF includes the output of CNN+BGRU, because including the marker characteristic of rare word in BGRU output (preset characters) allow BLSTM+CRF to isolate rare word and normal word, can further improve to rare word and normal The learning effect and recognition effect of word.
Further, in part-of-speech tagging method of the invention, the preset characters in step A-2 can be null character, can To be legal vector character defined in 0 or NaN or programming language, or using pre-defined character.Preset characters can be with BLSTM+CRF is set to identify rare word and normal word.
In method and step A-3 of the invention, text vector tool such as Word2Vec can be used, inputs text for first Term vector V1 is converted to, the second input is converted into term vector V2 herein.The term vector being embedded in by Word2Vec can be with Contacting between effective expression word and word is that word is embedded in one of common algorithm.
In the method for the invention, CNN model optimization uses the operation of maximum value pond, and the extraction part of speech of maximum possible is special Sign.
In method and step A-2 of the invention, the decision condition of rare word are as follows: in reference corpus, frequency of occurrence is lower than Preset value.Preset value is rule of thumb set, the number within generally 1-6.
It can be PFR corpus with reference to corpus.
PFR corpus is to have carried out word segmentation and part-of-speech tagging to the plain text corpus in People's Daily's first half of the year in 1998 It is made, in strict accordance with the date of People's Daily, version sequence, article sequential organization.Each word in article has word Property label.Have in current label sets 26 basic word mark (noun n, time word t, place word s, noun of locality f, number m, Quantifier q, it distinction word b, pronoun r, verb v, adjective a, descriptive word z, adverbial word d, preposition p, conjunction c, auxiliary word u, modal particle y, sighs Word e, onomatopoeia o, Chinese idiom i, idiom l, abbreviation j, enclitics h, ingredient k, morpheme g, non-morpheme word x, punctuate symbol are followed by Number w) outside, the angle applied from corpus, increases proper noun (name nr, place name ns, organization names nt, other proprietary names Word nz);Some labels are also increased from linguistics angle, have used more than 40 label in total.
Optionally, in method and step A-2 of the invention, the decision condition of rare word are as follows: do not appear in normal word dictionary In word be rare word.Normal word dictionary includes all normal words.
The invention also includes a kind of rare word part of speech feature separation methods, as shown in fig. 7, comprises following steps:
Step A-1 (S201): carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether step A-2 (S202): including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Step A-3 (S203): being term vector V1 by the first input text conversion, is term vector by the second input text conversion V2;
Step A-4 (S204): V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
Step A-5 (S205): V2 is inputted into BGRU model, CNN model exports word feature vector V2 ';
Step B (S206): connection V1, V1 ' and V2 ' obtain V3, the vector location in V3 comprising preset characters is rare word Feature vector units, the vector location that preset characters are not included in V3 is normal word feature vector unit.
Preferentially, in Fig. 7, CNN model uses the operation of maximum value pond.
Preferentially, in Fig. 7, the decision condition of rare word are as follows: in reference corpus, frequency of occurrence is lower than preset value.
V3, rare word feature vector unit and the normal word feature vector unit that Fig. 7 step B of the present invention is obtained can be used for it His neural network model, to improve the normal word of other models and the part-of-speech tagging accuracy rate of rare word.
Further, the decision condition of rare word may be set in Fig. 7: in reference corpus, frequency of occurrence is lower than pre- If value.
The invention also includes a kind of training methods for part-of-speech tagging model, which includes convolutional Neural Network C NN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF Model.
As shown in figure 8, the training method includes:
Step C-1 (S301): the sample data of training corpus is converted into the first input text;
Whether step A-2 (S302): including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Step A-3 (S303): being term vector V1 by the first input text conversion, is term vector by the second input text conversion V2;
Step A-4 (S304): V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
Step A-5 (S305): V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Step A-6 (S306): connection V1, V1 ' and V2 ' obtain V3, V3 is inputted into BLSTM model, and by BLSTM model It exports result and inputs CRF model, CRF model exports the part of speech label of all participles of training corpus sample data;
Step C-2 (S307): the part of speech of the part of speech label and training corpus sample data that calculate the output of CRF model marks it Between error, according to error update CNN, BGRU, BLSTM and CRF model.
The preferred PFR corpus of training corpus in Fig. 8.
In the S307 (step C-2) of Fig. 8, when updating CNN, BGRU, BLSTM and CRF model, Adam algorithm can be used The renewal process of Controlling model.
Adam (adaptive moment estimation) algorithm full name is adaptive moments estimation algorithm, in probability theory " square " is meant that: if a stochastic variable X obeys some distribution, the first moment of X is E (X), that is, sample mean, X Second moment with regard to E (X2), that is, the average value of sample square.Adam algorithm is according to loss function to the gradient of each parameter Single order moments estimation and second order moments estimation, dynamic adjustment are directed to the learning rate of each parameter.Adam is declined based on gradient Method, but the Learning Step of iterative parameter has a determining range every time, will not cause because of very big gradient very big Learning Step, the value of parameter is more stable.Preferentially, the present invention is set as every 3000 step and carries out primary learning rate index declining Subtract, decaying radix is 0.1, remaining parameter uses default setting.
When training pattern, the hyper parameter of setting CNN model and BLSTM model is also needed.For CNN model, hyper parameter includes The window size of filter is preferentially set as 2*50;Filter quantity, is preferentially set as 50.For BLSTM model, hide The hyper parameter of unit includes quantity and the number of plies, it may be preferable that quantity is set as 100, and the number of plies is set as 2 layers.
In addition, the present invention uses dropout technology also to reduce over-fitting situation, it may be preferable that dropout rate is 0.5.
The present invention can be such that loss function restrains, make simultaneously by the way that the parameters such as the above learning rate, CNN window size are arranged Obtain network has optimal performance under this architecture.
For example: learning parameter of the invention may be configured as, and batch size is 64 to accelerate convergence rate.Initial study Rate size is 0.01, attenuation rate 0.1, and training total period is 20000 times, to obtain the preferable part-of-speech tagging of learning effect (CNN+BGRU+BLSTM+CRF) model.
The embodiment of Fig. 9 of the present invention gives a kind of part-of-speech tagging system, includes convolutional neural networks CNN model, two-way Door cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF model, the system are also wrapped It includes:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Term vector generation module: being term vector V1 by the first input text conversion, is term vector by two input text conversions V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
BLSTM model: inputting BLSTM model for V3, and the output result of BLSTM model inputted CRF model,
CRF model: CRF model exports the part of speech label of all participles of text to be marked.
The operation of maximum value pond can be used in CNN model in Fig. 9.
In Fig. 9, the decision condition of rare word be may be configured as: in reference corpus, frequency of occurrence is lower than preset value.
Part-of-speech tagging system of the invention increases BGRU model on the basis of CNN+BLSTM+CRF model, compared to back The case where only CNN model that scape technology is previously mentioned, increased BGRU improves the extraction accuracy of the part of speech feature of normal word, The input value of BLSTM+CRF includes the output of CNN and BGRU simultaneously, because the marker characteristic comprising rare word is (pre- in BGRU output If character), allow BLSTM+CRF to isolate rare word and normal word, can further improve to rare word and normal word Learning effect and recognition effect.
The embodiment of Figure 10 of the present invention gives a kind of rare word part of speech feature separation system, includes convolutional neural networks CNN model, bidirectional gate cycling element BGRU model, the system further include:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
Characteristic separation module: the vector location in V3 comprising preset characters is rare word feature vector unit, is not wrapped in V3 Vector location containing preset characters is normal word feature vector unit.
The operation of maximum value pond can be used in CNN model in Figure 10.
In Figure 10, the decision condition of rare word be may be configured as: in reference corpus, frequency of occurrence is lower than preset value.
The embodiment of Figure 11 of the present invention gives a kind of training system for part-of-speech tagging model, the part-of-speech tagging model Including convolutional neural networks CNN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and Condition random field CRF model;
The training system includes:
Text conversion module: the sample data of training corpus is converted into the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, by the first input text This rare word replaces with preset characters, forms the second input text, if it is not, then the second input text is enabled to be equal to the first input Text;
Term vector generation module: by first input text conversion be term vector V1, by second input text conversion be word to Measure V2;
CNN model: V1 is inputted into CNN model, CNN model exports word feature vector V1 ';
BGRU model: V2 is inputted into BGRU model, BGRU model exports word feature vector V2 ';
Vector link block: connection V1, V1 ' and V2 ' obtain V3;
BLSTM model: V3 is inputted into BLSTM model, and the output result of BLSTM model is inputted into CRF model;
CRF model: CRF model exports the part of speech label of all participles of training corpus sample data.
Update module: it calculates between the part of speech label of CRF model output and the part of speech label of training corpus sample data Error, according to error update CNN, BGRU, BLSTM and CRF model.
It, can also be using Adam algorithm Controlling model more when updating CNN, BGRU, BLSTM and CRF model in Figure 11 New process.
The preferred PFR corpus of training corpus in Figure 11.
It should be noted that the embodiment of part-of-speech tagging system of the invention, the implementation with part-of-speech tagging method of the present invention Example principle is identical, and related place can mutual reference.
The foregoing is merely illustrative of the preferred embodiments of the present invention, not to limit scope of the invention, it is all Within the spirit and principle of technical solution of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this hair Within bright protection scope.

Claims (18)

1. a kind of part-of-speech tagging method, which is characterized in that include convolutional neural networks CNN model, bidirectional gate cycling element BGRU Model, two-way length Memorability network B LSTM model and condition random field CRF model, which comprises
Step A-1: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the first input text Rare word replace with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to the first input text This;
Step A-3: being term vector V1 by the first input text conversion, is term vector by the second input text conversion V2;
Step A-4: the V1 is inputted into CNN model, the CNN model exports word feature vector V1 ';
Step A-5: the V2 is inputted into BGRU model, the BGRU model exports word feature vector V2 ';
Step A-6: described V1, V1 are connected ' and V2 ' obtain V3, the V3 is inputted into BLSTM model, and by the BLSTM model Output result input CRF model, the CRF model export all participles of the text to be marked part of speech label.
2. the method according to claim 1, wherein the CNN model uses the operation of maximum value pond.
3. the method according to claim 1, wherein the decision condition of the rare word are as follows: in reference corpus, Frequency of occurrence is lower than preset value.
4. a kind of rare word part of speech feature separation method, which is characterized in that the described method includes:
Step A-1: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the first input text Rare word replace with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to the first input text This;
Step A-3: being term vector V1 by the first input text conversion, is term vector by the second input text conversion V2;
Step A-4: the V1 is inputted into convolutional neural networks CNN model, the CNN model exports word feature vector V1 ';
Step A-5: the V2 is inputted into bidirectional gate cycling element BGRU model, the BGRU model exports word feature vector V2 ';
Step B: described V1, V1 are connected ' and V2 ' obtain V3, the vector location in the V3 comprising the preset characters is rare Word feature vector unit, the vector location that the preset characters are not included in the V3 is normal word feature vector unit.
5. according to the method described in claim 4, it is characterized in that, the CNN model uses the operation of maximum value pond.
6. according to the method described in claim 4, it is characterized in that, the decision condition of the rare word are as follows: in reference corpus, Frequency of occurrence is lower than preset value.
7. a kind of training method for part-of-speech tagging model, which is characterized in that the part-of-speech tagging model includes convolutional Neural Network C NN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF Model;
The described method includes:
Step C-1: the sample data of training corpus is converted into the first input text;
Whether step A-2: including rare word in detection the first input text, if it is, by the first input text Rare word replace with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to the first input text This;
Step A-3: being term vector V1 by the first input text conversion, is term vector by the second input text conversion V2;
Step A-4: the V1 is inputted into CNN model, the CNN model exports word feature vector V1 ';
Step A-5: the V2 is inputted into BGRU model, the BGRU model exports word feature vector V2 ';
Step A-6: described V1, V1 are connected ' and V2 ' obtain V3, the V3 is inputted into BLSTM model, and by the BLSTM model Output result input CRF model, the CRF model exports the part of speech mark of all participles of the training corpus sample data Note;
Step C-2: it calculates between the part of speech label of the CRF model output and the part of speech label of the training corpus sample data Error, according to CNN, BGRU, BLSTM and CRF model described in the error update.
8. the method according to the description of claim 7 is characterized in that being adopted when updating CNN, BGRU, BLSTM and CRF model The renewal process of the model is controlled with Adam algorithm.
9. the method according to the description of claim 7 is characterized in that the training corpus is PFR corpus.
10. a kind of part-of-speech tagging system, which is characterized in that the system comprises:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be marked, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, defeated by described first The rare word for entering text replaces with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to first Input text;
Term vector generation module: being term vector V1 by the first input text conversion, is by the second input text conversion Term vector V2;
CNN model: the V1 is inputted into convolutional neural networks CNN model, the CNN model exports word feature vector V1 ';
BGRU model: the V2 is inputted into bidirectional gate cycling element BGRU model, the BGRU model exports word feature vector V2';
Vector link block: described V1, V1 are connected ' and V2 ' obtain V3;
BLSTM model: inputting two-way length Memorability network B LSTM model for the V3, and by the output of the BLSTM model As a result input condition random field CRF model,
CRF model: the CRF model exports the part of speech label of all participles of the text to be marked.
11. system according to claim 10, which is characterized in that the CNN model uses the operation of maximum value pond.
12. system according to claim 10, which is characterized in that the decision condition of the rare word are as follows: in reference corpus In, frequency of occurrence is lower than preset value.
13. a kind of rare word part of speech feature separation system, which is characterized in that it is characterized in that, the system comprises:
Text Pretreatment module: carrying out subordinate sentence, participle to text to be separated, forms the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, defeated by described first The rare word for entering text replaces with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to first Input text;
Term vector generation module: being term vector V1 by the first input text conversion, is by the second input text conversion Term vector V2;
CNN model: the V1 is inputted into convolutional neural networks CNN model, the CNN model exports word feature vector V1 ';
BGRU model: the V2 is inputted into bidirectional gate cycling element BGRU model, the BGRU model exports word feature vector V2';
Vector link block: described V1, V1 are connected ' and V2 ' obtain V3;
Characteristic separation module: the vector location in the V3 comprising the preset characters is rare word feature vector unit, described Vector location in V3 not comprising the preset characters is normal word feature vector unit.
14. system according to claim 13, which is characterized in that the CNN model uses the operation of maximum value pond.
15. system according to claim 13, which is characterized in that the decision condition of the rare word are as follows: in reference corpus In, frequency of occurrence is lower than preset value.
16. a kind of training system for part-of-speech tagging model, which is characterized in that the part-of-speech tagging model includes convolutional Neural Network C NN model, bidirectional gate cycling element BGRU model, two-way length Memorability network B LSTM model and condition random field CRF Model;
The system comprises:
Text conversion module: the sample data of training corpus is converted into the first input text;
Whether rare word processing module: including rare word in detection the first input text, if it is, defeated by described first The rare word for entering text replaces with preset characters, the second input text is formed, if it is not, then the second input text is enabled to be equal to first Input text;
Term vector generation module: being term vector V1 by the first input text conversion, is by the second input text conversion Term vector V2;
CNN model: the V1 is inputted into CNN model, the CNN model exports word feature vector V1 ';
BGRU model: the V2 is inputted into BGRU model, the BGRU model exports word feature vector V2 ';
Vector link block: described V1, V1 are connected ' and V2 ' obtain V3;
BLSTM model: the V3 is inputted into BLSTM model, and the output result of the BLSTM model is inputted into CRF model;
CRF model: the CRF model exports the part of speech label of all participles of the training corpus sample data.
Update module: the part of speech of the part of speech label and the training corpus sample data that calculate the CRF model output marks it Between error, according to CNN, BGRU, BLSTM and CRF model described in the error update.
17. system according to claim 16, which is characterized in that when updating CNN, BGRU, BLSTM and CRF model, The renewal process of the model is controlled using Adam algorithm.
18. according to the method for claim 16, which is characterized in that the training corpus is PFR corpus.
CN201711095902.8A 2017-11-09 2017-11-09 Part-of-speech tagging method and labeling system Pending CN109766523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711095902.8A CN109766523A (en) 2017-11-09 2017-11-09 Part-of-speech tagging method and labeling system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711095902.8A CN109766523A (en) 2017-11-09 2017-11-09 Part-of-speech tagging method and labeling system

Publications (1)

Publication Number Publication Date
CN109766523A true CN109766523A (en) 2019-05-17

Family

ID=66449760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711095902.8A Pending CN109766523A (en) 2017-11-09 2017-11-09 Part-of-speech tagging method and labeling system

Country Status (1)

Country Link
CN (1) CN109766523A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377691A (en) * 2019-07-23 2019-10-25 上海应用技术大学 Method, apparatus, equipment and the storage medium of text classification
CN110415683A (en) * 2019-07-10 2019-11-05 上海麦图信息科技有限公司 A kind of air control voice instruction recognition method based on deep learning
CN111444723A (en) * 2020-03-06 2020-07-24 深圳追一科技有限公司 Information extraction model training method and device, computer equipment and storage medium
CN111507104A (en) * 2020-03-19 2020-08-07 北京百度网讯科技有限公司 Method and device for establishing label labeling model, electronic equipment and readable storage medium
CN112183086A (en) * 2020-09-23 2021-01-05 北京先声智能科技有限公司 English pronunciation continuous reading mark model based on sense group labeling

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010250814A (en) * 2009-04-14 2010-11-04 Nec (China) Co Ltd Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN106960003A (en) * 2017-02-15 2017-07-18 黑龙江工程学院 Plagiarize the query generation method of the retrieval of the source based on machine learning in detection
CN107291795A (en) * 2017-05-03 2017-10-24 华南理工大学 A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010250814A (en) * 2009-04-14 2010-11-04 Nec (China) Co Ltd Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN106960003A (en) * 2017-02-15 2017-07-18 黑龙江工程学院 Plagiarize the query generation method of the retrieval of the source based on machine learning in detection
CN107291795A (en) * 2017-05-03 2017-10-24 华南理工大学 A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡婕等: "双向循环网络中文分词模型", 《小型微型计算机系统》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415683A (en) * 2019-07-10 2019-11-05 上海麦图信息科技有限公司 A kind of air control voice instruction recognition method based on deep learning
CN110377691A (en) * 2019-07-23 2019-10-25 上海应用技术大学 Method, apparatus, equipment and the storage medium of text classification
CN111444723A (en) * 2020-03-06 2020-07-24 深圳追一科技有限公司 Information extraction model training method and device, computer equipment and storage medium
CN111507104A (en) * 2020-03-19 2020-08-07 北京百度网讯科技有限公司 Method and device for establishing label labeling model, electronic equipment and readable storage medium
JP2021149916A (en) * 2020-03-19 2021-09-27 ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド Method for establishing label labeling model, device, electronic equipment, program, and readable storage medium
KR20210118360A (en) * 2020-03-19 2021-09-30 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Method, apparatus, electronic device, program and readable storage medium for creating a label marking model
JP7098853B2 (en) 2020-03-19 2022-07-12 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Methods for establishing label labeling models, devices, electronics, programs and readable storage media
US11531813B2 (en) 2020-03-19 2022-12-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, electronic device and readable storage medium for creating a label marking model
KR102645185B1 (en) * 2020-03-19 2024-03-06 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Method, apparatus, electronic device, program and readable storage medium for creating a label marking model
CN112183086A (en) * 2020-09-23 2021-01-05 北京先声智能科技有限公司 English pronunciation continuous reading mark model based on sense group labeling

Similar Documents

Publication Publication Date Title
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN107992597B (en) Text structuring method for power grid fault case
CN110222349B (en) Method and computer for deep dynamic context word expression
CN109766523A (en) Part-of-speech tagging method and labeling system
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN110083831A (en) A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110555084B (en) Remote supervision relation classification method based on PCNN and multi-layer attention
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN112199945A (en) Text error correction method and device
CN112818118B (en) Reverse translation-based Chinese humor classification model construction method
CN113268974B (en) Method, device and equipment for marking pronunciations of polyphones and storage medium
CN110134950B (en) Automatic text proofreading method combining words
CN110826334A (en) Chinese named entity recognition model based on reinforcement learning and training method thereof
CN113220876B (en) Multi-label classification method and system for English text
CN107977353A (en) A kind of mixing language material name entity recognition method based on LSTM-CNN
CN109492223A (en) A kind of Chinese missing pronoun complementing method based on ANN Reasoning
Yang et al. Recurrent neural network-based language models with variation in net topology, language, and granularity
CN107797988A (en) A kind of mixing language material name entity recognition method based on Bi LSTM
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN107992468A (en) A kind of mixing language material name entity recognition method based on LSTM
Chan et al. Applying and optimizing NLP model with CARU
CN114357166B (en) Text classification method based on deep learning
CN114386425B (en) Big data system establishing method for processing natural language text content
CN115659981A (en) Named entity recognition method based on neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190517

WD01 Invention patent application deemed withdrawn after publication