CN110196980A - A kind of field migration based on convolutional network in Chinese word segmentation task - Google Patents
A kind of field migration based on convolutional network in Chinese word segmentation task Download PDFInfo
- Publication number
- CN110196980A CN110196980A CN201910487638.5A CN201910487638A CN110196980A CN 110196980 A CN110196980 A CN 110196980A CN 201910487638 A CN201910487638 A CN 201910487638A CN 110196980 A CN110196980 A CN 110196980A
- Authority
- CN
- China
- Prior art keywords
- character
- input
- vector
- source domain
- convolutional layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of field moving methods based on convolutional network in Chinese word segmentation task, on the basis of the convolutional neural networks for Chinese word segmentation task, by in the maximization mean difference Maximum Mean Discrepancy method for being traditionally used for calculating different field distributional difference, introduce attention mechanism attention mechanism, so that during training neural network, attention mechanism, which can be obtained, migrates the more helpful sentence information of task for field, mean difference method will be maximized preferably to be introduced into sequence labelling task;Simultaneously, during calculating maximization mean difference, biggish weight is added to be capable of the sentence of positive migration, and it is the weight for not having helpful sentence or the sentence having a negative impact instead addition very little, realize more efficient field migration, the mark of artificial corpus is reduced, since mark corpus work bring is artificial and the pressure of time when alleviating natural language processing NLP research.
Description
Technical field
The present invention relates to Internet technical field more particularly to a kind of necks based on convolutional network in Chinese word segmentation task
Domain migration.
Background technique
With technical development of computer, computer is calculated power and is gradually reinforced, and machine learning, depth learning technology further obtain
Development, natural language processing are gradually applied to each scene, for example, using Text Classification film comment, shopping commodity
User preference is excavated in comment, induction and conclusion is carried out to articles such as news using summarization generation technology, or passes through machine translation
Technology realizes synchronous translation etc..A large amount of application scenarios need technology, while with the increase of domestic Internet user, producing
Raw information is also more and more.For mass data, automatic processing text information more highlights its significance.Therefore, because
Natural language processing technique can not replace and its for text-processing Ultra-High Efficiency, by social extensive concern.For state
For interior, Chinese processing is closely bound up with us.Chinese Automatic Word Segmentation technology, as the background task of natural language processing,
Its development is more crucial for other natural language processings.
Chinese word segmentation task is split i.e. by Chinese sentence or paragraph according to word so that higher from
In right language processing tasks, word bring more information improving performance can be passed through for the processing of Chinese.In why wanting
Text participle is because of in Modern Chinese, and one contains the word of specific meaning usually by two or more character representations, and
It cannot be understood by simple Chinese character.It is different often to occur meaning of the same character in different terms in Chinese
The case where.Therefore, it when carrying out other natural language processing tasks, needs and it is necessary to first carry out at participle to Chinese
Reason.Especially for the natural language processing task of part-of-speech tagging, the relatively low layer of name Entity recognition etc., for word segmentation processing according to
Lai Geng great.The accuracy of Chinese word segmentation will directly influence the superiority and inferiority of these mission performances.
Chinese word segmentation task passes through certain algorithm, handles computer automatically to Chinese language text, by word and word
Segmentation.Conventional method for Chinese word segmentation includes Forward Maximum Method, reversed maximum matching algorithm, point that probability disambiguates is added
Word algorithm, condition random field, structuring perceptron and maximum entropy model etc..Feedover mind in the deep learning method developed in recent years
It is all applied to through network, shot and long term Memory Neural Networks, convolutional neural networks in Chinese word segmentation task and in several large sizes
Higher accuracy rate is obtained on corpus.
Neural network method needs to utilize the data marked on a large scale.However existing large-scale corpus data contain only
In terms of News Field, and in the large-scale corpus of patent, literature and medical domain almost without this also results in existing mind
It is difficult to obtain higher accuracy rate on these fields through network technology.Therefore, the method for field migration in recent years is applied to
In Chinese word segmentation task, it is intended to be lacked by helping to be promoted using existing extensive mark corpus not marking corpus or only have
The Chinese word segmentation accuracy rate in amount mark corpus field.In the migration of field, there is the corpus marked on a large scale to be referred to as source domain
Data, the corpus for not marking or only marking on a small quantity are referred to as target domain data.Meanwhile utilizing the target domain of no mark
The field migration of data is referred to as unsupervised field migration, is referred to as using the field migration of the target domain data marked on a small quantity
Semi-supervised field migration.
Be currently used for the field migrating technology of Chinese word segmentation, a part of method be based on dictionary, using trained word to
Amount and word vector realize field migration;Another part method is directly modeled to transportable information by changing model,
By extracting transportable characteristic information in extensive mark corpus, field migration is realized.
As shown in Figure 1, " the Learning Transferable Features with Deep of one of prior art
In Adaptation Networks " article, mentions and being solved with depth adaptive network (Deep Adaptation Networks)
In the field migration problem of picture classification:
Firstly, carrying out pre-training to network using the Image data set containing a large amount of image datas;Secondly, being led using source
Domain labeled data either source domain labeled data and a small amount of target domain labeled data are finely adjusted network, preceding in fine tuning
Three layers are fixed as the convolution layer parameter for extracting general features, contain the convolution of field special characteristic for latter two layers as extraction field
Layer, is finely adjusted, and last three layers full articulamentum pass through MK-MMD (Multi-kernel Maximum Mean Discrepancy)
It carries out adaptive.
As shown in Fig. 2, the two of the prior art " Neural Aggregation Network for Video Face
Recognition " article proposes the technical side that recognition of face in video is solved the problems, such as by attention mechanism (attention)
Case,
Firstly, each frame in video, which is obtained face characteristic by convolutional neural networks, to be indicated, wherein convolutional Neural net
Network uses GoogLeNet, generates 128 dimensional features for each frame picture.Secondly, feature is as the input for paying attention to power module, input
Into first attention power module, wherein pay attention to power module in can learning parameter q, and output aggregation features be expressed as follows institute
Show:
ek=qTfkFormula (2-1)
Wherein fkFor the feature of the CNN each frame picture extracted, q is the core that can learn, ekFor the non-normalizing of each frame picture
Change weight distribution.
Wherein akFor the distribution of each frame picture normalized weight.
R=∑kakfkFormula (2-3)
Wherein, r is unrelated with picture sequence, according to the aggregation features after the weighting of attention core.
Third proceeds immediately to second attention module by the aggregation features of an attention module, into
The further characteristic aggregation of row, the learning to assess of second attention module are calculated as follows:
q1=tanh (Wr0+ b) formula (2-4)
Wherein, W is weight matrix, and b is bias term, and the two is all the parameter that can learn, and tanh is that tanh is non-linear
Function, r0Indicate the output of first attention module, q1The core of second attention module, aggregation features r1Meter
The same formula of calculation process (2-1), formula (2-2) and formula (2-3).
Finally, realizing identification mission by mean comparisons' loss function training network.
Inventor has found in the course of the study: for " Learning Transferable Features with
Deep Adaptation Networks”、“Neural Aggregation Network for Video Face
Recognition " is in the prior art:
1, image recognition tasks are directed to, the sequence labelling of natural language processing is not suitable for, Chinese word segmentation is not suitable for and appoints
Business;
2, only account for traditional MMD method, do not consider calculate MMD during, source domain it is that may be present uncorrelated or
It is the sample for generating counter productive;
Following disadvantage exists in the prior art since above-mentioned technical problem results in:
1, it is bad to directly apply to Chinese word segmentation mission effectiveness for tradition MMD method;
2, due to the difference of source domain sample, it is difficult to during calculating MMD, only extract the sample beneficial to target domain
This.
Summary of the invention
In order to solve the above-mentioned technical problems, the present invention provides a kind of necks based on convolutional network in Chinese word segmentation task
Domain migration method, based on the improvement to traditional MMD method, the sequence labelling task that can be applied in natural language processing,
Meanwhile by the way that attention module is added, so that model can be adaptively selected to moving to target data in MMD calculating
Beneficial source domain sample inhibits noise, improves field and migrates the effect in Chinese word segmentation task.
The present invention provides a kind of field moving method based on convolutional network in Chinese word segmentation task, do not have in training
When the target domain corpus data of mark, this method comprises:
Step 1: corpus is divided into source domain data and target domain data, by the field containing sentence negligible amounts
Data, by recycling, add to identical as another field sentence quantity;
Step 2: the Chinese character of source domain and target domain using same dictionary, is mapped as vector expression, input
Text, that is, numerical value to be segmented turn to each character vector and arrange the numerical matrix being formed by connecting;
Step 3: numerical matrix input feature vector convolutional layer is extracted to obtain source domain character representation and target domain feature
It indicates;
Step 4: the source domain character representation input of extraction is paid attention to power module, weight vectors, weight vectors are calculated
The source domain character representation that the source domain character representation that dot product is extracted is weighted;
Step 5: using the source domain character representation of weighting and the target domain character representation of extraction as calculating MMD
Two input, obtain MMD calculated result;
Step 6: the prediction label that the source domain character representation input classification convolutional layer of extraction is obtained each character is general
Rate;
Step 7: by the label probability of each character and true label probability input condition random field (Conditional
Random Field, CRF), calculate likelihood probability;
Step 8: negative logarithm and MMD calculated result weighted sum of the loss function by likelihood probability, MMD calculated result is made
For regular terms, is calculated by back-propagation algorithm (Back Propagation, BP) and update each layer weight of network.
Further, when in training containing the target domain corpus data marked on a small quantity, step 6 is replaced, is replaced
It is as follows:
Step 6: the input of the target domain character representation by the source domain character representation of extraction and on a small quantity containing true tag
Classification convolutional layer obtains the prediction label probability of each character.
Further, in non-training situation, when Chinese word segmentation, step 1 is replaced to step 8, is replaced as follows:
Step 1: the target domain data segmented will be needed as the input of neural network;
Step 2: the Chinese character of target domain data that will need to segment, using with dictionary identical in training process,
It is mapped as vector expression;
Step 3: vector is indicated input feature vector convolutional layer, extraction obtains character representation;
Step 4: character representation, which is inputted classification convolutional layer, obtains the prediction label probability of each character;
Step 5: decoding prediction label probability input Viterbi (Viterbi) algorithm of each character to obtain optimal sequence
Column complete participle.
Further, in the step 2, by the Chinese character of source domain and target domain, same dictionary, mapping are utilized
For vector expression, comprising:
The mapping dictionary of random initializtion is that identical character random initializtion is identical dense using word embedding grammar
Vector indicates, then each Chinese character of corpus data is mapped as dense vector expression by mapping dictionary;
Trained mapping dictionary utilizes bag of words Skip-Gram or Continuous Bag-of-Words
(CBOW), training, which obtains the vector comprising certain word information, indicates, each Chinese character of corpus data is passed through mapping
Dictionary, which is mapped as dense vector, to be indicated.
Further, in the step 3, numerical matrix input feature vector convolutional layer is extracted to obtain character representation, is calculated such as
Under:
Wherein, m ∈ Rd×wThe convolution kernel for being w for window size, d is identical as the line number of input matrix x,Indicate convolution behaviour
Make, x is the output of numerical matrix or upper layer feature convolutional layer, and b is bias term, and f is line rectification function (Rectified
Linear Unit, ReLU), y is the vector that dimension is n, the feature that vector y, that is, feature convolutional layer extracts;
Wherein, line rectification function ReLU formula is as follows:
F (x)=max (0, x)
Wherein, the case where being vector or matrix for above-mentioned input, x are the element in vector or matrix.
Further, in the step 4, the source domain character representation input of extraction is paid attention into power module, power is calculated
Weight vector, the source domain character representation that the source domain character representation that weight vectors dot product is extracted is weighted calculate as follows:
Wherein, k ∈ Ri×l×dFor weight matrix, i indicates the quantity of input neural network sentence, and l indicates that input sentence is fixed
Length, d are characterized the dimension of expression, and ⊙ indicates that dot product, the i character representation that y is characterized convolutional layer extraction are formed by connecting, g table
Show and the second of dot product result and third dimension be first subjected to average computation, then does softmax calculating,The source domain weighted
Character representation;
Wherein, softmax is calculated, and is expressed as follows:
Wherein, x is vector, xiFor i-th of element in vector.
Further, in the step 5, MMD calculation formula is as follows:
Wherein,Indicate MMD calculated result, p, q are respectively the distribution of two FIELD Datas, nsIndicate the data of source domain
Input sum, k () indicate gaussian kernel function, xsAnd xtRespectively indicate source domain weighted featureAnd target domain feature y;
Wherein, gaussian kernel function calculates as follows:
Wherein, x and z is two inputs of Gaussian kernel, and σ is Gaussian kernel bandwidth.
Further, in the step 6, by the prediction label probability for each character that classification convolutional layer extracts
Calculating process is consistent with feature convolutional layer, but the ReLU nonlinear function f that feature convolutional layer uses is replaced with softmax meter
It calculates.
Further, in the step 7, likelihood probability calculating process is as follows:
Wherein, S and y*Input sentence and the truth sentence sequence label are respectively indicated, score is indicated each
The prediction label probability of sentence is calculated by a transfer matrix for the prediction label probability of character, and y ' is all possible sentences
Prediction label probability;
Wherein, score function calculates as follows:
Wherein, AijFor transfer matrix, s () is the label probability of the single character of neural network prediction, and n is sequence length.
Further, in the step 8, loss function calculates as follows:
Wherein, λ indicates the weight of regularization term MMD.
A kind of field moving method based on convolutional network in Chinese word segmentation task provided by the invention, will be traditional
MMD method is applied in sequence labelling task, realizes one kind in Chinese word segmentation task, and the field of feature level migrates, side
Help the application for having widened the field moving method in sequence labelling task in natural language processing;By increasing attention mechanism mould
Block, enable model be adaptive selected for MMD calculate source domain sample, inhibit noise, realize feature level more
Efficient field migration;By utilizing existing extensive labeled data, the Chinese word segmentation accuracy rate being lifted in small corpus is delayed
The pressure of the artificial mark corpus of solution.
Detailed description of the invention
Fig. 1 is depth adaptation network (the Deep Adaptation Network) schematic diagram for solving image recognition tasks;
Fig. 2 is neural converging network (Neural Aggregation Network) schematic diagram;
Fig. 3 is the flow chart of embodiment one;
Fig. 4 is a kind of field moving method process based on convolutional network in Chinese word segmentation task provided by the invention
Figure.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.Wherein, the abbreviation and Key Term occurred in the present embodiment is defined as follows:
BP:Back Propagation backpropagation;
The continuous bag of words of CBOW:Continuous Bag-of-Words;
CNN:Convolutional Neural Network convolutional neural networks;
CRF:Conditional Random Field condition random field;
CTB:Chinese Treebank Penn Chinese treebank;
LSTM:Long Short-Term Memory shot and long term Memory Neural Networks;
ME:Maximum Entropy maximum entropy model;
MMD:Maximum Mean Discrepancy maximizes mean difference;
MK-MMD:Multi-Kernel Maximum Mean Discrepancy multicore maximizes mean difference;
NLP:Natural Language Processing natural language processing;
NN:Neural Network neural network;
PKU:Corpus of Peking University Peking University opens corpus;
ReLU:Rectified Linear Unit line rectification function, is a kind of activation primitive.
Embodiment one
Referring to shown in Fig. 3,4, Fig. 3,4 show a kind of convolutional network that is based on provided by the invention in Chinese word segmentation task
Field moving method, specifically, training in do not have mark target domain corpus data when, this method comprises:
Step 1: corpus is divided into source domain data and target domain data, by the field containing sentence negligible amounts
Data, by recycling, add to identical as another field sentence quantity;
Wherein, it is 128 that maximum sentence length is arranged in the present embodiment, using extensive mark source domain corpus be PKU with
And CTB5 and CTB7, target domain corpus are patent, put to death celestial and medicine forum corpus;When training every time simultaneously, half sentence is
Source domain data, the other half be then target domain data.
Step 2: the Chinese character of source domain and target domain using same dictionary, is mapped as vector expression, input
Text, that is, numerical value to be segmented turn to each character vector and arrange the numerical matrix being formed by connecting;
Further, trained mapping dictionary utilizes bag of words Skip-Gram or Continuous Bag-of-
Words (CBOW), training, which obtains the vector comprising certain word information, to be indicated, each Chinese character of corpus data is passed through
Mapping dictionary, which is mapped as dense vector, to be indicated, comprising:
Using the large amount of text information pre-training word vector of wikipedia;Building mapping dictionary all is not weighed by finding out then
Multiple character is each character number, and the vector of each identical characters indicates identical, and the vector of kinds of characters indicates different, together
When one vector of setting indicate all characters not occurred in training corpus set, for unknown character;In training network,
Dropout mechanism is introduced, at random by a part of parameter zero setting.
Pre-training is carried out to word vector using Skip-Gram in the present embodiment;Map vector dimension is arranged in each word
200;This step realizes that character, which is mapped as not sparse vector, to be indicated, training corpus is carried out first by a mapping dictionary
Traversal, find out all unduplicated characters, be each character number, it is assumed that share M character, then establish 200 rows (word to
Amount mapping dimension is that 200), the matrix of M+1 column, the vector of each identical characters indicates identical, and the vector of kinds of characters indicates not
Together, other than M character, also setting up a vector indicates all characters not occurred in training corpus set, for not
Know character.In this step, invention introduces dropout mechanism at random sets a part of parameter in training network
Zero, this avoid over-fitting and provide a kind of many different neuronal structures for effectively increasing indexation substantially
In conjunction with method.
Step 3: numerical matrix input feature vector convolutional layer is extracted to obtain source domain character representation and target domain feature
It indicates;
Further, in the step 3, numerical matrix input feature vector convolutional layer is extracted to obtain character representation, is calculated such as
Under:
Wherein, m ∈ Rd×wThe convolution kernel for being w for window size, d is identical as the line number of input matrix x,Indicate convolution behaviour
Make, x is the output of numerical matrix or upper layer feature convolutional layer, and b is bias term, and f is line rectification function (Rectified
Linear Unit, ReLU), y is the vector that dimension is n, the feature that vector y, that is, feature convolutional layer extracts;
In the present embodiment, the dimension of each character feature extracted by feature convolutional layer is 200 dimensions, and convolution kernel size is set
It is set to 3, the feature convolutional layer number of plies is set as 4, meanwhile, source domain and target domain character representation are calculated by shared convolutional layer
It arrives.
Step 4: the source domain character representation input of extraction is paid attention to power module, weight vectors, weight vectors are calculated
The source domain character representation that the source domain character representation that dot product is extracted is weighted;
Further, in the step 4, the source domain character representation input of extraction is paid attention into power module, power is calculated
Weight vector, the source domain character representation that the source domain character representation that weight vectors dot product is extracted is weighted calculate as follows:
Wherein, k ∈ Ri×l×dFor weight matrix, i indicates the quantity of input neural network sentence, and l indicates that input sentence is fixed
Length, d are characterized the dimension of expression, and ⊙ indicates that dot product, the i character representation that y is characterized convolutional layer extraction are formed by connecting, g table
Show and the second of dot product result and third dimension be first subjected to average computation, then does softmax calculating,The source domain weighted
Character representation;
In the present embodiment, the sentence number for inputting neural network is set as 16.
Step 5: using the source domain character representation of weighting and the target domain character representation of extraction as calculating MMD
Two input, obtain MMD calculated result;
Further, in the step 5, MMD calculation formula is as follows:
Wherein,Indicate MMD calculated result, p, q are respectively the distribution of two FIELD Datas, nsIndicate the data of source domain
Input sum, k () indicate Gauss kernel method, xsAnd xtRespectively indicate source domain weighted featureAnd target domain feature y;
Step 6: the prediction label that the source domain character representation input classification convolutional layer of extraction is obtained each character is general
Rate;
Further, in the step 6, by the prediction label probability for each character that classification convolutional layer extracts
Calculating process is consistent with feature convolutional layer, but the ReLU nonlinear function f that feature convolutional layer uses is replaced with softmax meter
It calculates;
In the present embodiment, there are four labels, including { B, M, E, S } for each character, wherein B indicates prefix word, M table
Show that word in word, E indicate that suffix word, S indicate monosyllabic word;Therefore, it is 4 that classification convolutional layer, which exports the feature of each character,.
Step 7: by the label probability of each character and true label probability input condition random field (Conditional
Random Field, CRF), calculate likelihood probability;
Further, in the step 7, likelihood probability calculating process is as follows:
Wherein, S and y*Input sentence and the truth sentence sequence label are respectively indicated, score is indicated each
The prediction label probability of sentence is calculated by a transfer matrix for the prediction label probability of character, and y ' is all possible sentences
Prediction label probability.
Step 8: negative logarithm and MMD calculated result weighted sum of the loss function by likelihood probability, MMD calculated result is made
For regular terms, is calculated by back-propagation algorithm (Back Propagation, BP) and update each layer weight of network;
Further, in the step 8, loss function calculates as follows:
Wherein, λ indicates the weight of regularization term MMD calculated result.
Further, when in training containing the target domain corpus data marked on a small quantity, step 6 is replaced, is replaced
It is as follows:
Step 6: the input of the target domain character representation by the source domain character representation of extraction and on a small quantity containing true tag
Classification convolutional layer obtains the prediction label probability of each character.
Further, in non-training situation, when Chinese word segmentation, step 1 is replaced to step 8, is replaced as follows:
Step 1: the target domain data segmented will be needed as the input of neural network;
Step 2: the Chinese character of target domain data that will need to segment, using with dictionary identical in training process,
It is mapped as vector expression;
Step 3: vector is indicated input feature vector convolutional layer, extraction obtains character representation;
Step 4: character representation, which is inputted classification convolutional layer, obtains the prediction label probability of each character;
Step 5: decoding prediction label probability input Viterbi (Viterbi) algorithm of each character to obtain optimal sequence
Column complete participle.
One preferred embodiment, as shown in figure 3, each of sentence character is mapped as a dense vector first, to
Amount dimension is n, and by convolution, extraction obtains the feature of each word in a word;In the training process, convolution is extracted to obtain
Feature two parts are divided into source domain and target domain, source domain data characteristics is calculated every by paying attention to power module
The weight of words;By source domain data characteristics and multiplied by weight, first input calculated as MMD;Target domain data are special
Levy second input calculated as MMD;The numerical value of MMD is calculated multiplied by regularization term weight as loss letter in MMD module
Several regularization terms;Source domain feature obtains the prediction probability of each label of each character by classification convolutional layer, passes through
CRF calculates log-likelihood probability;Negative log-likelihood probability is added to obtain final damage with MMD regularization term as objective function
Lose function;Model is to minimize loss function as target, by BP algorithm undated parameter;In non-training situation, by classification convolution
The prediction probability for each character each label that layer obtains is directly over viterbi algorithm, calculates last sequence label,
Complete participle.
The embodiment of the present invention one, which passes through, utilizes traditional MMD method, realizes in Chinese word segmentation task, feature level
Field migration, extends the field moving method of Chinese word segmentation;It will notice that power module introduces during calculating MMD, enable to
Model independently selects the source domain data sample beneficial for target domain data, inhibits the noise in the transition process of field, real
Existing more efficient field migration, alleviates the pressure of extensive mark corpus, improves field migration in small-scale labeled data
Accuracy rate on collection.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain
Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.
Claims (10)
1. a kind of field moving method based on convolutional network in Chinese word segmentation task, which is characterized in that do not marked in training
When the target domain corpus data of note, this method comprises:
Step 1: corpus is divided into source domain data and target domain data, by the number in the field containing sentence negligible amounts
According to, by recycling, add to identical as another field sentence quantity;
Step 2: by the Chinese character of source domain and target domain, using same dictionary, be mapped as vector expression, input to
Participle text, that is, numerical value turns to each character vector and arranges the numerical matrix being formed by connecting;
Step 3: numerical matrix input feature vector convolutional layer is extracted to obtain source domain character representation and target domain mark sheet
Show;
Step 4: the source domain character representation input of extraction is paid attention to power module, weight vectors, weight vectors dot product is calculated
The source domain character representation that the source domain character representation of extraction is weighted;
Step 5: using the source domain character representation of weighting and the target domain character representation of extraction as calculating the two of MMD
A input obtains MMD calculated result;
Step 6: the source domain character representation input classification convolutional layer of extraction is obtained the prediction label probability of each character;
Step 7: by the label probability of each character and true label probability input condition random field (Conditional
Random Field, CRF), calculate likelihood probability;
Step 8: negative logarithm and MMD calculated result weighted sum of the loss function by likelihood probability, MMD calculated result is as just
Then item is calculated by back-propagation algorithm (Back Propagation, BP) and updates each layer weight of network.
2. the method as described in claim 1, which is characterized in that contain the target domain corpus data marked on a small quantity in training
When, step 6 is replaced, is replaced as follows:
Step 6: the input classification of the target domain character representation by the source domain character representation of extraction and on a small quantity containing true tag
Convolutional layer obtains the prediction label probability of each character.
3. the method as described in claim 1, which is characterized in that in non-training situation, when Chinese word segmentation, by step 1 to step
Eight are replaced, and are replaced as follows:
Step 1: the target domain data segmented will be needed as the input of neural network;
Step 2: the Chinese character of target domain data that will need to segment, using with dictionary identical in training process, mapping
For vector expression;
Step 3: vector is indicated input feature vector convolutional layer, extraction obtains character representation;
Step 4: character representation, which is inputted classification convolutional layer, obtains the prediction label probability of each character;
Step 5: decode prediction label probability input Viterbi (Viterbi) algorithm of each character to obtain optimal sequence, it is complete
At participle.
4. the method as described in claim 1, which is characterized in that in the step 2, by the Chinese of source domain and target domain
Character is mapped as vector expression using same dictionary, comprising:
The mapping dictionary of random initializtion is the identical dense vector of identical character random initializtion using word embedding grammar
It indicates, then each Chinese character of corpus data is mapped as dense vector expression by mapping dictionary;
Trained mapping dictionary utilizes bag of words Skip-Gram or Continuous Bag-of-Words (CBOW), instruction
Getting the vector comprising certain word information indicates, each Chinese character of corpus data is mapped as by mapping dictionary
Dense vector indicates.
5. the method as described in claim 1, which is characterized in that in the step 3, by numerical matrix input feature vector convolutional layer
Extraction obtains character representation, calculates as follows:
Wherein, m ∈ Rd×wThe convolution kernel for being w for window size, d is identical as the line number of input matrix x,Indicate convolution operation, x
For the output of numerical matrix or upper layer feature convolutional layer, b is bias term, and f is line rectification function (Rectified Linear
Unit, ReLU), y is the vector that dimension is n, the feature that vector y, that is, feature convolutional layer extracts.
6. the method as described in claim 1, which is characterized in that in the step 4, the source domain character representation of extraction is defeated
Enter to pay attention to power module, weight vectors are calculated, the source neck that the source domain character representation that weight vectors dot product is extracted is weighted
Characteristic of field indicates, calculates as follows:
Wherein, k ∈ Ri×l×dFor weight matrix, i indicates the quantity of input neural network sentence, and l indicates the fixed length of input sentence
Degree, d are characterized the dimension of expression, and ⊙ indicates that dot product, the i character representation that y is characterized convolutional layer extraction are formed by connecting, and g is indicated
The second of dot product result and third dimension are first subjected to average computation, then do softmax calculating,The source domain weighted is special
Sign indicates.
7. the method as described in claim 1, which is characterized in that in the step 5, MMD calculation formula is as follows:
Wherein,Indicate MMD calculated result, p, q are respectively the distribution of two FIELD Datas, nsIndicate the data input of source domain
Sum, k () indicate gaussian kernel function, xsAnd xtRespectively indicate source domain weighted featureAnd target domain feature y.
8. the method as described in claim 1, which is characterized in that in the step 6, extracted by convolutional layer of classifying every
The prediction label probability calculation process of one character is consistent with feature convolutional layer, but the ReLU that feature convolutional layer is used is non-linear
Function f replaces with softmax calculating.
9. the method as described in claim 1, which is characterized in that in the step 7, likelihood probability calculating process is as follows:
Wherein, S and y*Input sentence and the truth sentence sequence label are respectively indicated, score is indicated each character
The prediction label probability of sentence is calculated by a transfer matrix for prediction label probability, and y ' is the prediction of all possible sentences
Label probability.
10. the method as described in claim 1, which is characterized in that in the step 8, loss function calculates as follows:
Wherein, λ indicates the weight of regularization term MMD.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910487638.5A CN110196980B (en) | 2019-06-05 | 2019-06-05 | Domain migration on Chinese word segmentation task based on convolutional network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910487638.5A CN110196980B (en) | 2019-06-05 | 2019-06-05 | Domain migration on Chinese word segmentation task based on convolutional network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110196980A true CN110196980A (en) | 2019-09-03 |
CN110196980B CN110196980B (en) | 2020-08-04 |
Family
ID=67754062
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910487638.5A Active CN110196980B (en) | 2019-06-05 | 2019-06-05 | Domain migration on Chinese word segmentation task based on convolutional network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110196980B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750974A (en) * | 2019-09-20 | 2020-02-04 | 成都星云律例科技有限责任公司 | Structured processing method and system for referee document |
CN110765775A (en) * | 2019-11-01 | 2020-02-07 | 北京邮电大学 | Self-adaptive method for named entity recognition field fusing semantics and label differences |
CN111008271A (en) * | 2019-11-20 | 2020-04-14 | 佰聆数据股份有限公司 | Neural network-based key information extraction method and system |
CN111091004A (en) * | 2019-12-18 | 2020-05-01 | 上海风秩科技有限公司 | Training method and training device for sentence entity labeling model and electronic equipment |
CN111127336A (en) * | 2019-11-18 | 2020-05-08 | 复旦大学 | Image signal processing method based on self-adaptive selection module |
CN111178149A (en) * | 2019-12-09 | 2020-05-19 | 中国资源卫星应用中心 | Automatic remote sensing image water body extraction method based on residual pyramid network |
CN111767718A (en) * | 2020-07-03 | 2020-10-13 | 北京邮电大学 | Chinese grammar error correction method based on weakened grammar error feature representation |
CN111984791A (en) * | 2020-09-02 | 2020-11-24 | 南京信息工程大学 | Long text classification method based on attention mechanism |
CN112415408A (en) * | 2020-11-10 | 2021-02-26 | 南昌济铃新能源科技有限责任公司 | Power battery SOC estimation method |
CN112580343A (en) * | 2020-11-03 | 2021-03-30 | 北京字节跳动网络技术有限公司 | Model generation method, question and answer quality judgment method, device, equipment and medium |
CN113076750A (en) * | 2021-04-26 | 2021-07-06 | 华南理工大学 | Cross-domain Chinese word segmentation system and method based on new word discovery |
CN114429129A (en) * | 2021-12-22 | 2022-05-03 | 南京信息工程大学 | Literature mining and material property prediction method |
CN114580412A (en) * | 2021-12-29 | 2022-06-03 | 西安工程大学 | Clothing entity identification method based on field adaptation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107967253A (en) * | 2017-10-27 | 2018-04-27 | 北京大学 | A kind of low-resource field segmenter training method and segmenting method based on transfer learning |
CN108197117A (en) * | 2018-01-31 | 2018-06-22 | 厦门大学 | A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
CN109753566A (en) * | 2019-01-09 | 2019-05-14 | 大连民族大学 | The model training method of cross-cutting sentiment analysis based on convolutional neural networks |
-
2019
- 2019-06-05 CN CN201910487638.5A patent/CN110196980B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107967253A (en) * | 2017-10-27 | 2018-04-27 | 北京大学 | A kind of low-resource field segmenter training method and segmenting method based on transfer learning |
CN108197117A (en) * | 2018-01-31 | 2018-06-22 | 厦门大学 | A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
CN109753566A (en) * | 2019-01-09 | 2019-05-14 | 大连民族大学 | The model training method of cross-cutting sentiment analysis based on convolutional neural networks |
Non-Patent Citations (5)
Title |
---|
ZUYI BAO等: "Neural Domain Adaptation with Contextualized Character Embedding for Chinese Word Segmentation", 《SPRING INTERNATIONAL》 * |
ZUYI BAO等: "Neural Regularized Domain Adaptation for ChineseWord Segmentation", 《PROCEEDINGS OF THE 9TH SIGHAN WORKSHOP ON CHINESE LANGUAGE PROCESSING》 * |
刘玉德: "基于深度学习的中文分词方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
宋鹏 等: "基于特征迁移学习方法的垮库语音情感识别", 《清华大学学报》 * |
皋军 等: "一种基于局部加权均值的领域适应学习框架", 《自动化学报》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750974A (en) * | 2019-09-20 | 2020-02-04 | 成都星云律例科技有限责任公司 | Structured processing method and system for referee document |
CN110750974B (en) * | 2019-09-20 | 2023-04-25 | 成都星云律例科技有限责任公司 | Method and system for structured processing of referee document |
CN110765775A (en) * | 2019-11-01 | 2020-02-07 | 北京邮电大学 | Self-adaptive method for named entity recognition field fusing semantics and label differences |
CN110765775B (en) * | 2019-11-01 | 2020-08-04 | 北京邮电大学 | Self-adaptive method for named entity recognition field fusing semantics and label differences |
CN111127336B (en) * | 2019-11-18 | 2023-05-02 | 复旦大学 | Image signal processing method based on self-adaptive selection module |
CN111127336A (en) * | 2019-11-18 | 2020-05-08 | 复旦大学 | Image signal processing method based on self-adaptive selection module |
CN111008271B (en) * | 2019-11-20 | 2022-06-24 | 佰聆数据股份有限公司 | Neural network-based key information extraction method and system |
CN111008271A (en) * | 2019-11-20 | 2020-04-14 | 佰聆数据股份有限公司 | Neural network-based key information extraction method and system |
CN111178149A (en) * | 2019-12-09 | 2020-05-19 | 中国资源卫星应用中心 | Automatic remote sensing image water body extraction method based on residual pyramid network |
CN111178149B (en) * | 2019-12-09 | 2023-09-29 | 中国四维测绘技术有限公司 | Remote sensing image water body automatic extraction method based on residual pyramid network |
CN111091004B (en) * | 2019-12-18 | 2023-08-25 | 上海风秩科技有限公司 | Training method and training device for sentence entity annotation model and electronic equipment |
CN111091004A (en) * | 2019-12-18 | 2020-05-01 | 上海风秩科技有限公司 | Training method and training device for sentence entity labeling model and electronic equipment |
CN111767718A (en) * | 2020-07-03 | 2020-10-13 | 北京邮电大学 | Chinese grammar error correction method based on weakened grammar error feature representation |
CN111767718B (en) * | 2020-07-03 | 2021-12-07 | 北京邮电大学 | Chinese grammar error correction method based on weakened grammar error feature representation |
CN111984791A (en) * | 2020-09-02 | 2020-11-24 | 南京信息工程大学 | Long text classification method based on attention mechanism |
CN111984791B (en) * | 2020-09-02 | 2023-04-25 | 南京信息工程大学 | Attention mechanism-based long text classification method |
CN112580343A (en) * | 2020-11-03 | 2021-03-30 | 北京字节跳动网络技术有限公司 | Model generation method, question and answer quality judgment method, device, equipment and medium |
CN112415408A (en) * | 2020-11-10 | 2021-02-26 | 南昌济铃新能源科技有限责任公司 | Power battery SOC estimation method |
CN113076750B (en) * | 2021-04-26 | 2022-12-16 | 华南理工大学 | Cross-domain Chinese word segmentation system and method based on new word discovery |
CN113076750A (en) * | 2021-04-26 | 2021-07-06 | 华南理工大学 | Cross-domain Chinese word segmentation system and method based on new word discovery |
CN114429129A (en) * | 2021-12-22 | 2022-05-03 | 南京信息工程大学 | Literature mining and material property prediction method |
CN114580412A (en) * | 2021-12-29 | 2022-06-03 | 西安工程大学 | Clothing entity identification method based on field adaptation |
CN114580412B (en) * | 2021-12-29 | 2024-06-04 | 西安工程大学 | Clothing entity identification method based on field adaptation |
Also Published As
Publication number | Publication date |
---|---|
CN110196980B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110196980A (en) | A kind of field migration based on convolutional network in Chinese word segmentation task | |
Abid et al. | Sentiment analysis through recurrent variants latterly on convolutional neural network of Twitter | |
CN108763326B (en) | Emotion analysis model construction method of convolutional neural network based on feature diversification | |
CN107145483B (en) | A kind of adaptive Chinese word cutting method based on embedded expression | |
Zhang et al. | Neural networks incorporating dictionaries for Chinese word segmentation | |
Dong et al. | Character-based LSTM-CRF with radical-level features for Chinese named entity recognition | |
Manoharan | Capsule network algorithm for performance optimization of text classification | |
Prusa et al. | Improving deep neural network design with new text data representations | |
CN109766524B (en) | Method and system for extracting combined purchasing recombination type notice information | |
CN110008338B (en) | E-commerce evaluation emotion analysis method integrating GAN and transfer learning | |
CN110765775B (en) | Self-adaptive method for named entity recognition field fusing semantics and label differences | |
US20240177047A1 (en) | Knowledge grap pre-training method based on structural context infor | |
CN108628823A (en) | In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training | |
CN107729309A (en) | A kind of method and device of the Chinese semantic analysis based on deep learning | |
Zhuang et al. | Natural language processing service based on stroke-level convolutional networks for Chinese text classification | |
CN110263325A (en) | Chinese automatic word-cut | |
CN112699685A (en) | Named entity recognition method based on label-guided word fusion | |
Boudad et al. | Exploring the use of word embedding and deep learning in arabic sentiment analysis | |
Naqvi et al. | Roman Urdu news headline classification empowered with machine learning | |
Su et al. | Low‐Rank Deep Convolutional Neural Network for Multitask Learning | |
Liu et al. | Research on advertising content recognition based on convolutional neural network and recurrent neural network | |
Huang et al. | Multi-view opinion mining with deep learning | |
Wang et al. | Joint Character‐Level Convolutional and Generative Adversarial Networks for Text Classification | |
Hu et al. | Scalable frame resolution for efficient continuous sign language recognition | |
Wu et al. | Conditional consistency regularization for semi-supervised multi-label image classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |