CN109766553A - A kind of Chinese word cutting method of the capsule model combined based on more regularizations - Google Patents

A kind of Chinese word cutting method of the capsule model combined based on more regularizations Download PDF

Info

Publication number
CN109766553A
CN109766553A CN201910018546.2A CN201910018546A CN109766553A CN 109766553 A CN109766553 A CN 109766553A CN 201910018546 A CN201910018546 A CN 201910018546A CN 109766553 A CN109766553 A CN 109766553A
Authority
CN
China
Prior art keywords
capsule
corpus
vector
character
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910018546.2A
Other languages
Chinese (zh)
Inventor
李明正
李思
孙忆南
徐雅静
王蓬辉
赵建博
刘伟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910018546.2A priority Critical patent/CN109766553A/en
Publication of CN109766553A publication Critical patent/CN109766553A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention provides a kind of Chinese word cutting methods of capsule model combined based on more regularizations, by increasing capsule slide window capsule sliding window, the technical issues of capsule model migration is applied in natural language processing NLP sequence labelling task i.e. Chinese word segmentation task, solves the task that capsule model is not particularly suited for sequence labelling;Multiple regularization terms are combined, realize simple field migration, capsule model is adapted in sequence labelling task, completes the Chinese word segmentation of higher accuracy, help more complicated natural language processing task by the present invention;By the joint of more regular terms, the generalization ability of model is improved, certain field migration is realized, the mark of artificial corpus can be reduced, reduces the artificial and time cost for manually marking corpus in natural language processing research.

Description

A kind of Chinese word cutting method of the capsule model combined based on more regularizations
Technical field
The present invention relates to Internet technical field more particularly to a kind of Chinese of the capsule model combined based on more regularizations Segmenting method.
Background technique
As the technologies such as information technology, machine learning develop, the technology for automatically processing information is gradually applied to various fields Scape, such as to user preference is excavated in the comment on commodity of film comment, shopping, the short summary etc. an of article is automatically generated, It requires to carry out automatic processing to text, and as Chinese user becomes increasingly active on the internet, the information of generation is also got over Come it is more, it is more necessary to the automatic processing of text information.The appearance of these situations so that natural language processing related skill Art is widely applied to each corner of society.And for natural language processing technique, at domestic natural language For the development of reason technology, Chinese Automatic Word Segmentation technology is wherein the most basic also the most key one of technology again.
Chinese word segmentation task is exactly that a Chinese is divided boundary according to word, so that during machine is easier to understand The technology of literary language.Chinese is different from English, has space as boundary between English word, and for Chinese, especially Modern Chinese, the composition of word generally are connected to obtain by two and more than two Chinese characters, and cannot by simply with Word is boundary understanding.This allows for computer when automatically processing Chinese text, needs first to carry out Chinese word segmentation to text, For many Chinese natural language processing techniques, for example, part-of-speech tagging, name Entity recognition, text classification, text snippet generate, The tasks such as event extraction, information retrieval, are all highly dependent on Chinese word segmentation.The quality of Chinese word segmentation will be to these dependent on Chinese The technology of participle task generates actively or negative effect, it can be seen that Chinese word segmentation task basic and establishes one Good Words partition system, the importance automatically processed for high performance text information.
Being done for task of Chinese word segmentation be exactly by algorithm by computer automatically between the word and word in Chinese language text Automatically plus boundary markers such as spaces.Chinese Word Automatic Segmentation has developed more than 20 years, the dictionary matching method since most, that is, most Classical Direct/Reverse maximum matching starts, and disambiguates the segmentation methods of model to probability is added, then arrive traditional condition random field (Conditional Random Fields, CRFs), structuring perceptron (Structure Perceptron), maximum entropy mould Type (Maximum Entropy, ME) and current neural network (Neural Network, NN) participle model.Segmentation methods It itself is being constantly progressive, and most sequence labelling task mainly relies on machine learning algorithm at present, sequence labelling passes through Algorithm distributes respective class label to each member of the sequence of observations.Chinese word segmentation task is generally also by as a sequence Mark task is handled, and by way of occurring in word to word and position carries out class indication to word, obtaining word sequence Word segmentation result.
For neural network method, extensive careful data mark will determine to the generation of the final result of each task Qualitatively influence, however extensive careful data mark needs to be completed by people, thereby produce great human cost and Time cost.Field migration task then wishes to reduce other field by migrating the existing data marked Data mark, such as Chinese word segmentation, the data marked on a large scale concentrate on News Field, in comparison other field Text data it is then fewer and fewer, therefore, how the key message of extensive labeled data collection moved into only a small amount of mark On the data set for the other field not marked even, become difficult point.Field migrating technology becomes more important therewith.
As shown in Figure 1, being mentioned in " Dynamic Routing between Capsules " article of one of prior art The model to solve the problems, such as Handwritten Digit Recognition technical solution:
Firstly, converting the matrix of 28x28 size as input for handwritten numeral picture;Second, the matrix of input is led to It crosses a convolutional layer and extracts feature, the convolution kernel of convolution operation shares 256, and convolution kernel size is 9x9, and convolution kernel step-length is 1, Using activation primitive, preferably line rectification function (Rectified Linear Unit, ReLU) output is 256 characteristic patterns, Every Zhang great little is 20x20;Previous step is exported as input, passes through primary roll gum deposit cystoblast (convolutional by third Primary capsule layer), 32 characteristic patterns are obtained, every figure is made of capsule, shares the capsule of 6x6, each capsule The vector tieed up for one 8;4th, previous step is exported as input, by digital capsule layer, this layer of size needs are classified Number shares 10 classification for Handwritten Digit Recognition task, therefore this layer shares 10 capsules, and each capsule is 16 dimensions Vector;5th, the probability of each classification is obtained by NONLINEAR CALCULATION, maximum probability is the classification classified.
As shown in Fig. 2, calculating outputting and inputting for capsule realizes that dynamic routing algorithm is only containing by dynamic routing algorithm Have and is carried out between the layer of capsule.
The length of capsule output vector can indicate the existing probability under current input of entity representated by capsule.Cause This, ensures to be compressed to compared with short amount close to zero and larger vector is compressed using non-linear " squeezing (squash) " function To close to 1.So that there is the study of differentiation to can be good at utilizing this nonlinear function:
Wherein vjIt is the output of j-th of capsule, sjIt is input summation.
For all capsules in addition to first layer, capsule sjIt is total input be in lower layer's predicted vectorOne power Weight and, and predicted vectorIt is that the capsule of lower layer is exported into uiWith a weight matrix WijIt is multiplied.
Wherein cijIt is the coefficient of coup is determined by iteration routing procedure.
The coefficient of coup between capsule i and be 1 in upper one layer of whole capsule summations, and the coefficient of coup is routed by one Softmax decision, initial logic bijIt is logarithm prior probability, i.e. capsule i should be coupled with capsule j
Priori logarithm can carry out distinguishing study as other weights simultaneously.They rely on part and two capsules Type, rather than depend on current image input.Later, the initial coefficient of coup is by measuring each glue in high one layer The current output v of capsule jijWith the prediction of capsuleBetween consistency repeat to refine.
Consistency is only scalar product,This consistency is considered log-likelihood ratio, and to connection glue Before all coefficients of coup of capsule i and higher capsule calculate new value, this consistency is added to initial logic bij
In convolution capsule layer, each capsule exports the local area network of a vector to each type of the capsule in high one layer Lattice, and each type of each section and capsule for grid uses different transformation matrixs.
As shown in figure 3, the two of the prior art " Deep Learning for Chinese Word Segmentation And POS Tagging " article proposes that feedforward neural network solves the problems, such as the technical solution of Chinese word segmentation:
Firstly, converting vector for Chinese character, each identical Chinese character is converted into d and ties up identical vector, passes through One dictionary is realized;Second, converting Hidden unit for previous step vector using feedforward neural network indicates;Third, by upper one Step output carries out Nonlinear Mapping as input, by sigmoid function;4th, previous step is exported as input, before To neural network, dimension dimension classification quantity is exported, in Chinese word segmentation, classification quantity is 4;5th, previous step is four-dimensional Softmax is made in output, obtains each word and is divided into different classes of probability, these probability is decoded by viterbi algorithm, meter Optimal sequence is calculated, is segmented.
Inventor has found in the course of the study: skill existing for " Dynamic routing between capsules " In art:
1, just for the task of Handwritten Digit Recognition, overall model is not particularly suited for sequence labelling task;
2, the regularization term of reconstruction image is not particularly suited for Chinese word segmentation task;
Following disadvantage exists in the prior art since above-mentioned technical problem results in:
1, accuracy rate is general;
2, capsule model is not particularly suited for the task of sequence labelling;
3, Generalization Capability is general, i.e., can only be trained and test on same corpus, when the test of replacement other field When corpus, performance can sharply decline.
Summary of the invention
In order to solve the above-mentioned technical problems, the present invention provides a kind of Chinese of capsule model combined based on more regularizations Segmenting method realizes the Chinese word segmentation of higher accuracy based on the application to capsule model, so that training is surveyed on same corpus The result of examination is more accurate, by the change and extension to capsule model, realizes capsule model answering in sequence labelling task With;By the joint to more regularization terms, realize on capsule model for the field migration effect of Chinese word segmentation task.
The present invention provides a kind of Chinese word cutting method of capsule model combined based on more regularizations, training particular areas When corpus, this method comprises:
Step 1: identify the maximum length of sentence in corpus, using pre-stored character by corpus it is insufficient most The sentence length polishing of long length is to maximum length;
Step 2: the Chinese character of sentence in corpus, which is mapped as vector, to be indicated;
Step 3: vector is indicated to extract feature by convolutional layer;
Step 4: the feature extracted is input to primary capsule layer (primary capsule layer), pass through convolution Operation, obtaining a scalar to each word of each characteristic pattern indicates, the expression of obtained scalar is connected to become vector and is made For a capsule, one is obtained by primary capsule (primary capsules);
Step 5: obtained primary capsule (primary capsules) is passed through capsule slide window (capsule Sliding window) character representation of some word is obtained, to adapt to sequence labelling task;
Step 6: routing will be used as by the feature obtained after capsule slide window (capsule sliding window) The input of algorithm obtains label capsule (tag capsules) by routing algorithm;
Step 7: label capsule (tag capsules) output modulus is obtained the label probability of each character;
Step 8: the label probability of each character and true label probability are made cross entropy respectively and input condition is random Field (Conditional Random Field, CRF) calculates likelihood probability;
Step 9: loss function is summed it up to obtain by cross entropy and log-likelihood probability by weight, by log-likelihood probability As the regular terms of loss function, is calculated by back-propagation algorithm (Backpropagation, BP) and update each layer weight of network.
Further, at least two field corpus of training, using one of field corpus as the target neck finally segmented When the corpus of domain, step 9 is replaced, is replaced as follows:
Step 9: loss function is by finally segmenting the corpus, log-likelihood probability, the corpus for being migrated participle field in field It sums it up to obtain by weight, wherein the likelihood probability of target domain corpus and as regularization term, meanwhile, the friendship of source domain corpus The weight for pitching entropy is less than non-regularization term.
Further, loss function is indicated by following formula:
Loss=λ1cross entropytarget2Likelihoodtarget3cross entropysource
Wherein, in formula first item indicate target domain corpus cross entropy, Section 2 be target domain likelihood probability, Section 3 is the cross entropy of source domain corpus.λnFor every weight;
Further, in non-training situation, when Chinese word segmentation, step 8 and step 9 are replaced.It replaces as follows:
The label probability for needing each character of the testing material segmented is decoded to obtain by step 8 by viterbi algorithm Optimal sequence completes participle.
Further, in step 2, the Chinese character of sentence in corpus, which is mapped as vector, to be indicated, comprising:
By mapping dictionary, the Chinese character of sentence in corpus is mapped as not sparse by the method being embedded in using word Vector indicates.
Further, by mapping dictionary, the Chinese character of sentence in corpus is mapped as by the method being embedded in using word Not sparse vector indicates, comprising:
Training corpus is traversed, and all unduplicated characters are found out, be each character number, each identical characters to Amount indicates identical, and the vector of kinds of characters indicates different, while one vector of setting indicates all not in training corpus set The character of middle appearance is unknown character;
In training network, dropout mechanism is introduced, at random by a part of parameter zero setting.
Further, in the step 3, vector is indicated to extract feature by convolutional layer, comprising:
Vector is indicated by obtaining a certain number of characteristic patterns after convolutional layer, every characteristic pattern represent one it is one-dimensional to Amount, vector dimension is sentence length, the vector that characteristic pattern represents is connected into matrix, the feature extracted for convolutional layer.
Further, logarithm prior probability b in the routing algorithmijUpdate be expressed as follows:
Wherein,For predicted vector, vjIt is the output of j-th of capsule.
Further, the cross entropy formula is indicated by following formula:
Cross entropy=- ∑ipreal(i)log(ppred(i))
Wherein, preal(i) true probability of each label of each character, p are indicatedpred(i) each of each character is indicated The prediction probability of label sums the cross entropy of each character to obtain the cross entropy of a word;
Likelihood probability is indicated by the following formula of formula:
Wherein, p (yreal) indicate the probability of correct sequence, all possible sequence of y ' expression in formula denominator;
Further, the loss function is indicated by following formula:
Loss=λ1cross entropy+λ2Likelihood
Wherein, cross entropy indicates that cross entropy, Likelihood indicate likelihood probability, λnIndicate preceding two power Weight.
The Chinese word cutting method of a kind of capsule model combined based on more regularizations provided by the invention, by by die of capsule Type adapts in sequence labelling task, completes the Chinese word segmentation of higher accuracy, helps more complicated natural language processing (Natural Language Processing, NLP) task;By the joint of more regular terms, the extensive energy of model is improved Power realizes certain field migration, can reduce the mark of artificial corpus, reduces and manually marks corpus in NLP research Artificial and time cost.
Detailed description of the invention
Fig. 1 is the capsule model schematic diagram to solve the problems, such as Handwritten Digit Recognition;
Fig. 2 is dynamic routing algorithm schematic diagram;
Fig. 3 is neural network structure schematic diagram;
Fig. 4 is a kind of Chinese word cutting method flow chart of capsule model combined based on more regularizations provided by the invention;
Fig. 5 is the flow chart of embodiment one;
Fig. 6 is convolution operation schematic diagram.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.Wherein, the abbreviation and Key Term occurred in the present embodiment is defined as follows:
NN:Neural Network neural network;
CNN:Convolutional Neural Network convolutional neural networks;
LSTM:Long Short-Term Memory shot and long term Memory Neural Networks;
CTB:Chinese Treebank Penn Chinese treebank;
CRFs:Conditional Random Fields condition random field;
FNN:Feedforward neural network feedforward neural network;
ReLU:Rectified Linear Unit line rectification function, is a kind of activation primitive;
ME:Maximum Entropy maximum entropy model;
NLP:Natural Language Processing natural language processing.
Embodiment one
Referring to shown in Fig. 4-6, Fig. 4,5 show a kind of capsule model combined based on more regularizations provided by the invention Chinese word cutting method, specifically, when training particular area corpus, this method comprises:
Step 1: identify the maximum length of sentence in corpus, using pre-stored character by corpus it is insufficient most The sentence length polishing of long length is to maximum length.
Wherein, it is 128 that maximum sentence length is arranged in the present embodiment, corpus CTB6.0;The purpose of this step is All sentences being input in network model are all fixed as uniform length.
Step 2: the Chinese character of sentence in corpus, which is mapped as vector, to be indicated;By mapping dictionary, it is embedded in using word Method, the Chinese character of sentence in corpus, which is mapped as not sparse vector, to be indicated.
Further, the Chinese character of sentence in corpus is mapped as not using word embedding grammar by mapping dictionary Sparse vector indicates, comprising:
Training corpus is traversed, and all unduplicated characters are found out, be each character number, each identical characters to Amount indicates identical, and the vector of kinds of characters indicates different, while one vector of setting indicates all not in training corpus set The character of middle appearance is unknown character;In training network, dropout mechanism is introduced, at random by a part of parameter zero setting.
It is 200 to each word setting map vector dimension in the present embodiment;This step passes through a mapping dictionary and realizes, Character, which is mapped as not sparse vector, indicates that training corpus is traversed first, is found out all unduplicated characters, is each Character number, it is assumed that share M character, then establish 200 rows (word DUAL PROBLEMS OF VECTOR MAPPING dimension is 200), the matrix of M+1 column, often The vectors of a identical characters indicates identical, and the vector of kinds of characters indicates different, other than M character, also set up one to Amount indicates institute either with or without the character occurred in training corpus set, for unknown character.In this step, it is automatic to use for reference denoising The thought of encoder, invention introduces dropout mechanism, at random by a part of parameter zero setting, to keep away in this way in training network Exempt from over-fitting and provides a kind of side for effectively substantially combining many different neuronal structures that indexation increases Method.
Step 3: vector is indicated to extract feature by convolutional layer;Vector is indicated by obtaining a fixed number after convolutional layer The characteristic pattern of amount, every characteristic pattern represent an one-dimensional vector, and vector dimension is sentence length, and the vector that characteristic pattern is represented connects It is connected into matrix, the feature extracted for convolutional layer.
The characteristic pattern quantity for extracting feature is set as 200, and every characteristic pattern is an one-dimensional vector, and vector dimension is sentence The characteristic pattern (vector) obtained after convolution is connected into matrix by length, the feature extracted for convolutional layer;
In this step, the convolutional neural networks of single layer are good at the extraction of local feature, and multilayer convolutional neural networks Superposition, also can play good effect to the study of context.As shown in Figure 6 is indicated for the convolution operation of text.
Step 4: the feature extracted is input to primary capsule layer (primary capsule layer), pass through convolution Operation, obtaining a scalar to each word of each characteristic pattern indicates, the expression of obtained scalar is connected to become vector and is made For a capsule, a primary capsule (primary capsules) is obtained;
In the present embodiment, convolution operation is had altogether eight times, and operation obtains one to some word of each characteristic pattern every time Scalar indicates, the scalar expression that eight times obtain is connected to become vector, as a capsule, finally obtain one is indicated by capsule Eigenmatrix.
Step 5: obtained primary capsule (primary capsules) is passed through capsule slide window (capsule Sliding window) character representation of some word is obtained, to adapt to sequence labelling task;
Capsule sliding window is set as 7 in the present embodiment;
In this step, capsule slide window (capsule sliding window) is a sliding window, by previous step Eigenmatrix is indicated by capsule in rapid, wherein each column represent a character, sliding window is selected for each character, It is influenced by the character of front and back n, i.e., by the total n character in front and back (including this character itself), as dynamic routing algorithm Input.
Step 6: routing will be used as by the feature obtained after capsule slide window (capsule sliding window) The input of algorithm obtains label capsule tag capsules by routing algorithm;
The number of label capsule is the quantity 4 of classification in the present embodiment, and the dimension of label capsule is 16 dimensions, and dynamic routing is calculated The number of iterations of method is 3 times;
The step of routing algorithm, enumerates as follows:
First, it will pass through in the capsule of primary capsule layer (primary capsule layer), that is, vector Squash Nonlinear Mapping formula (2-1) obtains the output v of primary capsule layer (primary capsule layer)j.
Second, using the output of primary capsule layer (primary capsule layer) as trademark adhesive cystoblast (tag Capsule input) is calculated trademark adhesive cystoblast (tag capsule) capsule by formula (2-2), then these capsules is passed through Squash Nonlinear Mapping is crossed, the output of trademark adhesive cystoblast (tag capsule) is obtained.
For the logit parameter b of softmaxijUpdate by formula (3-1) indicate:
Wherein,For predicted vector, vjIt is the output of j-th of capsule.
The calculation formula of the coefficient of coup such as formula (3-2) indicates:
Step 7: label capsule tag capsules output modulus is obtained the label probability of each character;
Step 8: the label probability of each character and true label probability are made cross entropy respectively and input condition is random Field (Conditional Random Field, CRF) calculates likelihood probability;
Step 9: loss function is summed it up to obtain by cross entropy and log-likelihood probability by weight, by log-likelihood probability As the regular terms of loss function, is calculated by back-propagation algorithm (Backpropagation, BP) and update each layer weight of network.
Further, the loss function is indicated by following formula:
Loss=λ1cross entropy+λ2Likelihood
Wherein, cross entropy indicates that cross entropy, Likelihood indicate likelihood probability, λnIndicate preceding two power Weight.
Wherein, at least two field corpus of training, using one of field corpus as the target domain language finally segmented When material, step 9 is replaced, is replaced as follows:
Step 9: loss function is by finally segmenting the corpus, log-likelihood probability, the corpus for being migrated participle field in field It sums it up to obtain by weight, wherein the likelihood probability of target domain corpus and as regularization term, meanwhile, the friendship of source domain corpus The weight for pitching entropy is less than non-regularization term.
Further, loss function is indicated by following formula:
Loss=λ1cross entropytarget2Likelihoodtarget3cross entropysource
Wherein, in formula first item indicate target domain corpus cross entropy, Section 2 be target domain likelihood probability, Section 3 is the cross entropy of source domain corpus.λnFor every weight.
In non-training situation, when Chinese word segmentation, step 8 and step 9 are replaced, (i.e. in non-training situation, step Rapid nine are removed), it replaces as follows:
The label probability for needing each character of the testing material segmented is decoded to obtain by step 8 by viterbi algorithm Optimal sequence completes participle.
Further, the cross entropy formula is indicated by following formula:
Cross entropy=- ∑ipreal(i)log(ppred(i))
Wherein, preal(i) true probability of each label of each character, p are indicatedpred(i) each of each character is indicated The prediction probability of label sums the cross entropy of each character to obtain the cross entropy of a word.
Likelihood probability is indicated by the following formula of formula:
Wherein, p (yreal) indicate the probability of correct sequence, all possible sequence of y ' expression in formula denominator.
One preferred embodiment, as shown in figure 4, first reflecting each of sentence character by taking " A friend in need is a friend indeed " as an example It penetrates as a dense vector, vector dimension n, by convolution, extraction obtains the feature of each word in a word;Convolution is mentioned The feature obtained is input in capsule layer, and the feature that convolution obtains is passed through to the feature for being connected to primary capsule layer, is passed through One capsule slide window and iteration router-level topology obtain the feature of trademark adhesive cystoblast, calculate vector field homoemorphism in trademark adhesive cystoblast Obtain the probability of each label;Finally, inferring layer by label, using viterbi algorithm, in short optimal sequence mark is found out Note realizes participle.
The embodiment of the present invention one is by increasing capsule slide window (capsule sliding window), by capsule model Migration is applied to the Chinese word segmentation that higher accuracy is completed in NLP sequence labelling task (Chinese word segmentation task), helps more multiple Miscellaneous natural language processing (Natural Language Processing, NLP) task;By the joint of more regular terms, promoted The generalization ability of model realizes certain field migration, can reduce the mark of artificial corpus, reduces in NLP research The artificial and time cost of artificial mark corpus.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (10)

1. a kind of Chinese word cutting method of the capsule model combined based on more regularizations, which is characterized in that training particular area language When material, this method comprises:
Step 1: identifying the maximum length of sentence in corpus, most greatly enhanced using pre-stored character by insufficient in corpus The sentence length polishing of degree is to maximum length;
Step 2: the Chinese character of sentence in corpus, which is mapped as vector, to be indicated;
Step 3: vector is indicated to extract feature by convolutional layer;
Step 4: the feature extracted is input to primary capsule layer (primary capsule layer), grasped by convolution Make, obtaining scalar to each word of each characteristic pattern indicates, using the expression of obtained scalar be connected to become vector as One capsule obtains a primary capsule (primary capsules);
Step 5: obtained primary capsule (primary capsules) is passed through capsule slide window (capsule sliding Window the character representation of some word is obtained) to adapt to sequence labelling task;
Step 6: will be by the feature that is obtained after capsule slide window (capsule sliding window) as routing algorithm Input, obtain label capsule (tag capsules) by routing algorithm;
Step 7: label capsule (tag capsules) output modulus is obtained the label probability of each character;
Step 8: the label probability of each character and true label probability are made cross entropy and input condition random field respectively (Conditional Random Field, CRF) calculates likelihood probability;
Step 9: loss function is summed it up to obtain by cross entropy and log-likelihood probability by weight, using log-likelihood probability as The regular terms of loss function is calculated by back-propagation algorithm (Backpropagation, BP) and updates each layer weight of network.
2. the method as described in claim 1, which is characterized in that at least two field corpus of training, with one of field language When material is as the target domain corpus finally segmented, step 9 is replaced, is replaced as follows:
Step 9: loss function is passed through by finally segmenting the corpus in field, log-likelihood probability, the corpus for being migrated participle field Weight sums it up to obtain, wherein the likelihood probability of target domain corpus and as regularization term, meanwhile, the cross entropy of source domain corpus Weight be less than non-regularization term.
3. method according to claim 2, which is characterized in that loss function is indicated by following formula:
Loss=λ1cross entropytarget2Likelihoodtarget3crossentropysource
Wherein, in formula first item indicate target domain corpus cross entropy, Section 2 be target domain likelihood probability, third Item is the cross entropy of source domain corpus.λnFor every weight.
4. the method as described in claim 1, which is characterized in that in non-training situation, when Chinese word segmentation, by step 8 and step Nine are replaced, and are replaced as follows:
Step 8, will need the label probability of each character of testing material segmented decode to obtain by viterbi algorithm it is optimal Sequence completes participle.
5. the method as described in claim 1, which is characterized in that in step 2, the Chinese character of sentence in corpus is mapped For vector expression, comprising:
The Chinese character of sentence in corpus is mapped as using word embedding grammar by not sparse vector table by mapping dictionary Show.
6. method as claimed in claim 5, which is characterized in that by mapping dictionary, the method being embedded in using word, by corpus The Chinese character of middle sentence, which is mapped as not sparse vector, to be indicated, comprising:
Training corpus is traversed, and all unduplicated characters are found out, and is each character number, the vector table of each identical characters Show identical, the vector expression difference of kinds of characters, while one vector of setting indicates all and goes out not in training corpus set Existing character is unknown character;
In training network, dropout mechanism is introduced, at random by a part of parameter zero setting.
7. the method as described in claim 1, which is characterized in that in the step 3, vector is indicated to extract by convolutional layer Feature, comprising:
Vector is indicated that every characteristic pattern represents an one-dimensional vector by obtaining a certain number of characteristic patterns after convolutional layer, to Amount dimension is sentence length, the vector that characteristic pattern represents is connected into matrix, the feature extracted for convolutional layer.
8. the method as described in claim 1, which is characterized in that logarithm prior probability b in the routing algorithmijUpdate indicate It is as follows:
Wherein,For predicted vector, vjIt is the output of j-th of capsule.
9. the method as described in claim 1, which is characterized in that the cross entropy formula is indicated by following formula:
Cross entropy=- ∑ipreal(i)log(ppred(i))
Wherein, preal(i) true probability of each label of each character, p are indicatedpred(i) each label of each character is indicated Prediction probability, the cross entropy of each character is summed to obtain the cross entropy of a word.
Likelihood probability is indicated by the following formula of formula:
Wherein, p (yreal) indicate the probability of correct sequence, all possible sequence of y ' expression in formula denominator.
10. the method as described in claim 1, which is characterized in that the loss function is indicated by following formula:
Loss=λ1cross entropy+λ2Likelihood
Wherein, cross entropy indicates that cross entropy, Likelihood indicate likelihood probability, λnIndicate preceding two weights.
CN201910018546.2A 2019-01-09 2019-01-09 A kind of Chinese word cutting method of the capsule model combined based on more regularizations Pending CN109766553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910018546.2A CN109766553A (en) 2019-01-09 2019-01-09 A kind of Chinese word cutting method of the capsule model combined based on more regularizations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910018546.2A CN109766553A (en) 2019-01-09 2019-01-09 A kind of Chinese word cutting method of the capsule model combined based on more regularizations

Publications (1)

Publication Number Publication Date
CN109766553A true CN109766553A (en) 2019-05-17

Family

ID=66453491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910018546.2A Pending CN109766553A (en) 2019-01-09 2019-01-09 A kind of Chinese word cutting method of the capsule model combined based on more regularizations

Country Status (1)

Country Link
CN (1) CN109766553A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263855A (en) * 2019-06-20 2019-09-20 深圳大学 A method of it is projected using cobasis capsule and carries out image classification
CN110825849A (en) * 2019-11-05 2020-02-21 泰康保险集团股份有限公司 Text information emotion analysis method, device, medium and electronic equipment
CN111460818A (en) * 2020-03-31 2020-07-28 中国测绘科学研究院 Web page text classification method based on enhanced capsule network and storage medium
CN112270285A (en) * 2020-11-09 2021-01-26 天津工业大学 SAR image change detection method based on sparse representation and capsule network
CN112579746A (en) * 2019-09-29 2021-03-30 京东数字科技控股有限公司 Method and device for acquiring behavior information corresponding to text
CN116757534A (en) * 2023-06-15 2023-09-15 中国标准化研究院 Intelligent refrigerator reliability analysis method based on neural training network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression
CN108920467A (en) * 2018-08-01 2018-11-30 北京三快在线科技有限公司 Polysemant lexical study method and device, search result display methods

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression
CN108920467A (en) * 2018-08-01 2018-11-30 北京三快在线科技有限公司 Polysemant lexical study method and device, search result display methods

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SI LI 等: "Capsules Based Chinese Word Segmentation for Ancient Chinese Medical Books", 《SPECIAL SECTION ON AI-DRIVEN BIG DATA PROCESSING: THEORY, METHODOLOGY, AND APPLICATIONS》 *
李裕礞: "基于用户隐性反馈行为的下一个购物篮推荐", 《中文信息学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263855A (en) * 2019-06-20 2019-09-20 深圳大学 A method of it is projected using cobasis capsule and carries out image classification
CN110263855B (en) * 2019-06-20 2021-12-14 深圳大学 Method for classifying images by utilizing common-basis capsule projection
CN112579746A (en) * 2019-09-29 2021-03-30 京东数字科技控股有限公司 Method and device for acquiring behavior information corresponding to text
CN110825849A (en) * 2019-11-05 2020-02-21 泰康保险集团股份有限公司 Text information emotion analysis method, device, medium and electronic equipment
CN111460818A (en) * 2020-03-31 2020-07-28 中国测绘科学研究院 Web page text classification method based on enhanced capsule network and storage medium
CN112270285A (en) * 2020-11-09 2021-01-26 天津工业大学 SAR image change detection method based on sparse representation and capsule network
CN112270285B (en) * 2020-11-09 2022-07-08 天津工业大学 SAR image change detection method based on sparse representation and capsule network
CN116757534A (en) * 2023-06-15 2023-09-15 中国标准化研究院 Intelligent refrigerator reliability analysis method based on neural training network
CN116757534B (en) * 2023-06-15 2024-03-15 中国标准化研究院 Intelligent refrigerator reliability analysis method based on neural training network

Similar Documents

Publication Publication Date Title
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN109766553A (en) A kind of Chinese word cutting method of the capsule model combined based on more regularizations
CN109284506B (en) User comment emotion analysis system and method based on attention convolution neural network
CN106980683B (en) Blog text abstract generating method based on deep learning
CN110472042B (en) Fine-grained emotion classification method
CN109933670B (en) Text classification method for calculating semantic distance based on combined matrix
Stojanovski et al. Twitter sentiment analysis using deep convolutional neural network
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN110929034A (en) Commodity comment fine-grained emotion classification method based on improved LSTM
CN112667818B (en) GCN and multi-granularity attention fused user comment sentiment analysis method and system
CN111401061A (en) Method for identifying news opinion involved in case based on BERT and Bi L STM-Attention
CN109408812A (en) A method of the sequence labelling joint based on attention mechanism extracts entity relationship
CN108182295A (en) A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN108268643A (en) A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN110765775A (en) Self-adaptive method for named entity recognition field fusing semantics and label differences
Zhang et al. Sentiment Classification Based on Piecewise Pooling Convolutional Neural Network.
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN110196980A (en) A kind of field migration based on convolutional network in Chinese word segmentation task
CN110263325A (en) Chinese automatic word-cut
CN111651974A (en) Implicit discourse relation analysis method and system
CN108345583A (en) Event recognition and sorting technique based on multi-lingual attention mechanism and device
CN111582506A (en) Multi-label learning method based on global and local label relation
CN108920446A (en) A kind of processing method of Engineering document
CN112632993A (en) Electric power measurement entity recognition model classification method based on convolution attention network
CN114781382A (en) Medical named entity recognition system and method based on RWLSTM model fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190517