CN110196980A - A kind of field migration based on convolutional network in Chinese word segmentation task - Google Patents

A kind of field migration based on convolutional network in Chinese word segmentation task Download PDF

Info

Publication number
CN110196980A
CN110196980A CN201910487638.5A CN201910487638A CN110196980A CN 110196980 A CN110196980 A CN 110196980A CN 201910487638 A CN201910487638 A CN 201910487638A CN 110196980 A CN110196980 A CN 110196980A
Authority
CN
China
Prior art keywords
character
input
vector
source domain
convolutional layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910487638.5A
Other languages
Chinese (zh)
Other versions
CN110196980B (en
Inventor
李思
李明正
孙忆南
徐雅静
陈�光
王蓬辉
周欣雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910487638.5A priority Critical patent/CN110196980B/en
Publication of CN110196980A publication Critical patent/CN110196980A/en
Application granted granted Critical
Publication of CN110196980B publication Critical patent/CN110196980B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of field moving methods based on convolutional network in Chinese word segmentation task, on the basis of the convolutional neural networks for Chinese word segmentation task, by in the maximization mean difference Maximum Mean Discrepancy method for being traditionally used for calculating different field distributional difference, introduce attention mechanism attention mechanism, so that during training neural network, attention mechanism, which can be obtained, migrates the more helpful sentence information of task for field, mean difference method will be maximized preferably to be introduced into sequence labelling task;Simultaneously, during calculating maximization mean difference, biggish weight is added to be capable of the sentence of positive migration, and it is the weight for not having helpful sentence or the sentence having a negative impact instead addition very little, realize more efficient field migration, the mark of artificial corpus is reduced, since mark corpus work bring is artificial and the pressure of time when alleviating natural language processing NLP research.

Description

A kind of field migration based on convolutional network in Chinese word segmentation task
Technical field
The present invention relates to Internet technical field more particularly to a kind of necks based on convolutional network in Chinese word segmentation task Domain migration.
Background technique
With technical development of computer, computer is calculated power and is gradually reinforced, and machine learning, depth learning technology further obtain Development, natural language processing are gradually applied to each scene, for example, using Text Classification film comment, shopping commodity User preference is excavated in comment, induction and conclusion is carried out to articles such as news using summarization generation technology, or passes through machine translation Technology realizes synchronous translation etc..A large amount of application scenarios need technology, while with the increase of domestic Internet user, producing Raw information is also more and more.For mass data, automatic processing text information more highlights its significance.Therefore, because Natural language processing technique can not replace and its for text-processing Ultra-High Efficiency, by social extensive concern.For state For interior, Chinese processing is closely bound up with us.Chinese Automatic Word Segmentation technology, as the background task of natural language processing, Its development is more crucial for other natural language processings.
Chinese word segmentation task is split i.e. by Chinese sentence or paragraph according to word so that higher from In right language processing tasks, word bring more information improving performance can be passed through for the processing of Chinese.In why wanting Text participle is because of in Modern Chinese, and one contains the word of specific meaning usually by two or more character representations, and It cannot be understood by simple Chinese character.It is different often to occur meaning of the same character in different terms in Chinese The case where.Therefore, it when carrying out other natural language processing tasks, needs and it is necessary to first carry out at participle to Chinese Reason.Especially for the natural language processing task of part-of-speech tagging, the relatively low layer of name Entity recognition etc., for word segmentation processing according to Lai Geng great.The accuracy of Chinese word segmentation will directly influence the superiority and inferiority of these mission performances.
Chinese word segmentation task passes through certain algorithm, handles computer automatically to Chinese language text, by word and word Segmentation.Conventional method for Chinese word segmentation includes Forward Maximum Method, reversed maximum matching algorithm, point that probability disambiguates is added Word algorithm, condition random field, structuring perceptron and maximum entropy model etc..Feedover mind in the deep learning method developed in recent years It is all applied to through network, shot and long term Memory Neural Networks, convolutional neural networks in Chinese word segmentation task and in several large sizes Higher accuracy rate is obtained on corpus.
Neural network method needs to utilize the data marked on a large scale.However existing large-scale corpus data contain only In terms of News Field, and in the large-scale corpus of patent, literature and medical domain almost without this also results in existing mind It is difficult to obtain higher accuracy rate on these fields through network technology.Therefore, the method for field migration in recent years is applied to In Chinese word segmentation task, it is intended to be lacked by helping to be promoted using existing extensive mark corpus not marking corpus or only have The Chinese word segmentation accuracy rate in amount mark corpus field.In the migration of field, there is the corpus marked on a large scale to be referred to as source domain Data, the corpus for not marking or only marking on a small quantity are referred to as target domain data.Meanwhile utilizing the target domain of no mark The field migration of data is referred to as unsupervised field migration, is referred to as using the field migration of the target domain data marked on a small quantity Semi-supervised field migration.
Be currently used for the field migrating technology of Chinese word segmentation, a part of method be based on dictionary, using trained word to Amount and word vector realize field migration;Another part method is directly modeled to transportable information by changing model, By extracting transportable characteristic information in extensive mark corpus, field migration is realized.
As shown in Figure 1, " the Learning Transferable Features with Deep of one of prior art In Adaptation Networks " article, mentions and being solved with depth adaptive network (Deep Adaptation Networks) In the field migration problem of picture classification:
Firstly, carrying out pre-training to network using the Image data set containing a large amount of image datas;Secondly, being led using source Domain labeled data either source domain labeled data and a small amount of target domain labeled data are finely adjusted network, preceding in fine tuning Three layers are fixed as the convolution layer parameter for extracting general features, contain the convolution of field special characteristic for latter two layers as extraction field Layer, is finely adjusted, and last three layers full articulamentum pass through MK-MMD (Multi-kernel Maximum Mean Discrepancy) It carries out adaptive.
As shown in Fig. 2, the two of the prior art " Neural Aggregation Network for Video Face Recognition " article proposes the technical side that recognition of face in video is solved the problems, such as by attention mechanism (attention) Case,
Firstly, each frame in video, which is obtained face characteristic by convolutional neural networks, to be indicated, wherein convolutional Neural net Network uses GoogLeNet, generates 128 dimensional features for each frame picture.Secondly, feature is as the input for paying attention to power module, input Into first attention power module, wherein pay attention to power module in can learning parameter q, and output aggregation features be expressed as follows institute Show:
ek=qTfkFormula (2-1)
Wherein fkFor the feature of the CNN each frame picture extracted, q is the core that can learn, ekFor the non-normalizing of each frame picture Change weight distribution.
Wherein akFor the distribution of each frame picture normalized weight.
R=∑kakfkFormula (2-3)
Wherein, r is unrelated with picture sequence, according to the aggregation features after the weighting of attention core.
Third proceeds immediately to second attention module by the aggregation features of an attention module, into The further characteristic aggregation of row, the learning to assess of second attention module are calculated as follows:
q1=tanh (Wr0+ b) formula (2-4)
Wherein, W is weight matrix, and b is bias term, and the two is all the parameter that can learn, and tanh is that tanh is non-linear Function, r0Indicate the output of first attention module, q1The core of second attention module, aggregation features r1Meter The same formula of calculation process (2-1), formula (2-2) and formula (2-3).
Finally, realizing identification mission by mean comparisons' loss function training network.
Inventor has found in the course of the study: for " Learning Transferable Features with Deep Adaptation Networks”、“Neural Aggregation Network for Video Face Recognition " is in the prior art:
1, image recognition tasks are directed to, the sequence labelling of natural language processing is not suitable for, Chinese word segmentation is not suitable for and appoints Business;
2, only account for traditional MMD method, do not consider calculate MMD during, source domain it is that may be present uncorrelated or It is the sample for generating counter productive;
Following disadvantage exists in the prior art since above-mentioned technical problem results in:
1, it is bad to directly apply to Chinese word segmentation mission effectiveness for tradition MMD method;
2, due to the difference of source domain sample, it is difficult to during calculating MMD, only extract the sample beneficial to target domain This.
Summary of the invention
In order to solve the above-mentioned technical problems, the present invention provides a kind of necks based on convolutional network in Chinese word segmentation task Domain migration method, based on the improvement to traditional MMD method, the sequence labelling task that can be applied in natural language processing, Meanwhile by the way that attention module is added, so that model can be adaptively selected to moving to target data in MMD calculating Beneficial source domain sample inhibits noise, improves field and migrates the effect in Chinese word segmentation task.
The present invention provides a kind of field moving method based on convolutional network in Chinese word segmentation task, do not have in training When the target domain corpus data of mark, this method comprises:
Step 1: corpus is divided into source domain data and target domain data, by the field containing sentence negligible amounts Data, by recycling, add to identical as another field sentence quantity;
Step 2: the Chinese character of source domain and target domain using same dictionary, is mapped as vector expression, input Text, that is, numerical value to be segmented turn to each character vector and arrange the numerical matrix being formed by connecting;
Step 3: numerical matrix input feature vector convolutional layer is extracted to obtain source domain character representation and target domain feature It indicates;
Step 4: the source domain character representation input of extraction is paid attention to power module, weight vectors, weight vectors are calculated The source domain character representation that the source domain character representation that dot product is extracted is weighted;
Step 5: using the source domain character representation of weighting and the target domain character representation of extraction as calculating MMD Two input, obtain MMD calculated result;
Step 6: the prediction label that the source domain character representation input classification convolutional layer of extraction is obtained each character is general Rate;
Step 7: by the label probability of each character and true label probability input condition random field (Conditional Random Field, CRF), calculate likelihood probability;
Step 8: negative logarithm and MMD calculated result weighted sum of the loss function by likelihood probability, MMD calculated result is made For regular terms, is calculated by back-propagation algorithm (Back Propagation, BP) and update each layer weight of network.
Further, when in training containing the target domain corpus data marked on a small quantity, step 6 is replaced, is replaced It is as follows:
Step 6: the input of the target domain character representation by the source domain character representation of extraction and on a small quantity containing true tag Classification convolutional layer obtains the prediction label probability of each character.
Further, in non-training situation, when Chinese word segmentation, step 1 is replaced to step 8, is replaced as follows:
Step 1: the target domain data segmented will be needed as the input of neural network;
Step 2: the Chinese character of target domain data that will need to segment, using with dictionary identical in training process, It is mapped as vector expression;
Step 3: vector is indicated input feature vector convolutional layer, extraction obtains character representation;
Step 4: character representation, which is inputted classification convolutional layer, obtains the prediction label probability of each character;
Step 5: decoding prediction label probability input Viterbi (Viterbi) algorithm of each character to obtain optimal sequence Column complete participle.
Further, in the step 2, by the Chinese character of source domain and target domain, same dictionary, mapping are utilized For vector expression, comprising:
The mapping dictionary of random initializtion is that identical character random initializtion is identical dense using word embedding grammar Vector indicates, then each Chinese character of corpus data is mapped as dense vector expression by mapping dictionary;
Trained mapping dictionary utilizes bag of words Skip-Gram or Continuous Bag-of-Words (CBOW), training, which obtains the vector comprising certain word information, indicates, each Chinese character of corpus data is passed through mapping Dictionary, which is mapped as dense vector, to be indicated.
Further, in the step 3, numerical matrix input feature vector convolutional layer is extracted to obtain character representation, is calculated such as Under:
Wherein, m ∈ Rd×wThe convolution kernel for being w for window size, d is identical as the line number of input matrix x,Indicate convolution behaviour Make, x is the output of numerical matrix or upper layer feature convolutional layer, and b is bias term, and f is line rectification function (Rectified Linear Unit, ReLU), y is the vector that dimension is n, the feature that vector y, that is, feature convolutional layer extracts;
Wherein, line rectification function ReLU formula is as follows:
F (x)=max (0, x)
Wherein, the case where being vector or matrix for above-mentioned input, x are the element in vector or matrix.
Further, in the step 4, the source domain character representation input of extraction is paid attention into power module, power is calculated Weight vector, the source domain character representation that the source domain character representation that weight vectors dot product is extracted is weighted calculate as follows:
Wherein, k ∈ Ri×l×dFor weight matrix, i indicates the quantity of input neural network sentence, and l indicates that input sentence is fixed Length, d are characterized the dimension of expression, and ⊙ indicates that dot product, the i character representation that y is characterized convolutional layer extraction are formed by connecting, g table Show and the second of dot product result and third dimension be first subjected to average computation, then does softmax calculating,The source domain weighted Character representation;
Wherein, softmax is calculated, and is expressed as follows:
Wherein, x is vector, xiFor i-th of element in vector.
Further, in the step 5, MMD calculation formula is as follows:
Wherein,Indicate MMD calculated result, p, q are respectively the distribution of two FIELD Datas, nsIndicate the data of source domain Input sum, k () indicate gaussian kernel function, xsAnd xtRespectively indicate source domain weighted featureAnd target domain feature y;
Wherein, gaussian kernel function calculates as follows:
Wherein, x and z is two inputs of Gaussian kernel, and σ is Gaussian kernel bandwidth.
Further, in the step 6, by the prediction label probability for each character that classification convolutional layer extracts Calculating process is consistent with feature convolutional layer, but the ReLU nonlinear function f that feature convolutional layer uses is replaced with softmax meter It calculates.
Further, in the step 7, likelihood probability calculating process is as follows:
Wherein, S and y*Input sentence and the truth sentence sequence label are respectively indicated, score is indicated each The prediction label probability of sentence is calculated by a transfer matrix for the prediction label probability of character, and y ' is all possible sentences Prediction label probability;
Wherein, score function calculates as follows:
Wherein, AijFor transfer matrix, s () is the label probability of the single character of neural network prediction, and n is sequence length.
Further, in the step 8, loss function calculates as follows:
Wherein, λ indicates the weight of regularization term MMD.
A kind of field moving method based on convolutional network in Chinese word segmentation task provided by the invention, will be traditional MMD method is applied in sequence labelling task, realizes one kind in Chinese word segmentation task, and the field of feature level migrates, side Help the application for having widened the field moving method in sequence labelling task in natural language processing;By increasing attention mechanism mould Block, enable model be adaptive selected for MMD calculate source domain sample, inhibit noise, realize feature level more Efficient field migration;By utilizing existing extensive labeled data, the Chinese word segmentation accuracy rate being lifted in small corpus is delayed The pressure of the artificial mark corpus of solution.
Detailed description of the invention
Fig. 1 is depth adaptation network (the Deep Adaptation Network) schematic diagram for solving image recognition tasks;
Fig. 2 is neural converging network (Neural Aggregation Network) schematic diagram;
Fig. 3 is the flow chart of embodiment one;
Fig. 4 is a kind of field moving method process based on convolutional network in Chinese word segmentation task provided by the invention Figure.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.Wherein, the abbreviation and Key Term occurred in the present embodiment is defined as follows:
BP:Back Propagation backpropagation;
The continuous bag of words of CBOW:Continuous Bag-of-Words;
CNN:Convolutional Neural Network convolutional neural networks;
CRF:Conditional Random Field condition random field;
CTB:Chinese Treebank Penn Chinese treebank;
LSTM:Long Short-Term Memory shot and long term Memory Neural Networks;
ME:Maximum Entropy maximum entropy model;
MMD:Maximum Mean Discrepancy maximizes mean difference;
MK-MMD:Multi-Kernel Maximum Mean Discrepancy multicore maximizes mean difference;
NLP:Natural Language Processing natural language processing;
NN:Neural Network neural network;
PKU:Corpus of Peking University Peking University opens corpus;
ReLU:Rectified Linear Unit line rectification function, is a kind of activation primitive.
Embodiment one
Referring to shown in Fig. 3,4, Fig. 3,4 show a kind of convolutional network that is based on provided by the invention in Chinese word segmentation task Field moving method, specifically, training in do not have mark target domain corpus data when, this method comprises:
Step 1: corpus is divided into source domain data and target domain data, by the field containing sentence negligible amounts Data, by recycling, add to identical as another field sentence quantity;
Wherein, it is 128 that maximum sentence length is arranged in the present embodiment, using extensive mark source domain corpus be PKU with And CTB5 and CTB7, target domain corpus are patent, put to death celestial and medicine forum corpus;When training every time simultaneously, half sentence is Source domain data, the other half be then target domain data.
Step 2: the Chinese character of source domain and target domain using same dictionary, is mapped as vector expression, input Text, that is, numerical value to be segmented turn to each character vector and arrange the numerical matrix being formed by connecting;
Further, trained mapping dictionary utilizes bag of words Skip-Gram or Continuous Bag-of- Words (CBOW), training, which obtains the vector comprising certain word information, to be indicated, each Chinese character of corpus data is passed through Mapping dictionary, which is mapped as dense vector, to be indicated, comprising:
Using the large amount of text information pre-training word vector of wikipedia;Building mapping dictionary all is not weighed by finding out then Multiple character is each character number, and the vector of each identical characters indicates identical, and the vector of kinds of characters indicates different, together When one vector of setting indicate all characters not occurred in training corpus set, for unknown character;In training network, Dropout mechanism is introduced, at random by a part of parameter zero setting.
Pre-training is carried out to word vector using Skip-Gram in the present embodiment;Map vector dimension is arranged in each word 200;This step realizes that character, which is mapped as not sparse vector, to be indicated, training corpus is carried out first by a mapping dictionary Traversal, find out all unduplicated characters, be each character number, it is assumed that share M character, then establish 200 rows (word to Amount mapping dimension is that 200), the matrix of M+1 column, the vector of each identical characters indicates identical, and the vector of kinds of characters indicates not Together, other than M character, also setting up a vector indicates all characters not occurred in training corpus set, for not Know character.In this step, invention introduces dropout mechanism at random sets a part of parameter in training network Zero, this avoid over-fitting and provide a kind of many different neuronal structures for effectively increasing indexation substantially In conjunction with method.
Step 3: numerical matrix input feature vector convolutional layer is extracted to obtain source domain character representation and target domain feature It indicates;
Further, in the step 3, numerical matrix input feature vector convolutional layer is extracted to obtain character representation, is calculated such as Under:
Wherein, m ∈ Rd×wThe convolution kernel for being w for window size, d is identical as the line number of input matrix x,Indicate convolution behaviour Make, x is the output of numerical matrix or upper layer feature convolutional layer, and b is bias term, and f is line rectification function (Rectified Linear Unit, ReLU), y is the vector that dimension is n, the feature that vector y, that is, feature convolutional layer extracts;
In the present embodiment, the dimension of each character feature extracted by feature convolutional layer is 200 dimensions, and convolution kernel size is set It is set to 3, the feature convolutional layer number of plies is set as 4, meanwhile, source domain and target domain character representation are calculated by shared convolutional layer It arrives.
Step 4: the source domain character representation input of extraction is paid attention to power module, weight vectors, weight vectors are calculated The source domain character representation that the source domain character representation that dot product is extracted is weighted;
Further, in the step 4, the source domain character representation input of extraction is paid attention into power module, power is calculated Weight vector, the source domain character representation that the source domain character representation that weight vectors dot product is extracted is weighted calculate as follows:
Wherein, k ∈ Ri×l×dFor weight matrix, i indicates the quantity of input neural network sentence, and l indicates that input sentence is fixed Length, d are characterized the dimension of expression, and ⊙ indicates that dot product, the i character representation that y is characterized convolutional layer extraction are formed by connecting, g table Show and the second of dot product result and third dimension be first subjected to average computation, then does softmax calculating,The source domain weighted Character representation;
In the present embodiment, the sentence number for inputting neural network is set as 16.
Step 5: using the source domain character representation of weighting and the target domain character representation of extraction as calculating MMD Two input, obtain MMD calculated result;
Further, in the step 5, MMD calculation formula is as follows:
Wherein,Indicate MMD calculated result, p, q are respectively the distribution of two FIELD Datas, nsIndicate the data of source domain Input sum, k () indicate Gauss kernel method, xsAnd xtRespectively indicate source domain weighted featureAnd target domain feature y;
Step 6: the prediction label that the source domain character representation input classification convolutional layer of extraction is obtained each character is general Rate;
Further, in the step 6, by the prediction label probability for each character that classification convolutional layer extracts Calculating process is consistent with feature convolutional layer, but the ReLU nonlinear function f that feature convolutional layer uses is replaced with softmax meter It calculates;
In the present embodiment, there are four labels, including { B, M, E, S } for each character, wherein B indicates prefix word, M table Show that word in word, E indicate that suffix word, S indicate monosyllabic word;Therefore, it is 4 that classification convolutional layer, which exports the feature of each character,.
Step 7: by the label probability of each character and true label probability input condition random field (Conditional Random Field, CRF), calculate likelihood probability;
Further, in the step 7, likelihood probability calculating process is as follows:
Wherein, S and y*Input sentence and the truth sentence sequence label are respectively indicated, score is indicated each The prediction label probability of sentence is calculated by a transfer matrix for the prediction label probability of character, and y ' is all possible sentences Prediction label probability.
Step 8: negative logarithm and MMD calculated result weighted sum of the loss function by likelihood probability, MMD calculated result is made For regular terms, is calculated by back-propagation algorithm (Back Propagation, BP) and update each layer weight of network;
Further, in the step 8, loss function calculates as follows:
Wherein, λ indicates the weight of regularization term MMD calculated result.
Further, when in training containing the target domain corpus data marked on a small quantity, step 6 is replaced, is replaced It is as follows:
Step 6: the input of the target domain character representation by the source domain character representation of extraction and on a small quantity containing true tag Classification convolutional layer obtains the prediction label probability of each character.
Further, in non-training situation, when Chinese word segmentation, step 1 is replaced to step 8, is replaced as follows:
Step 1: the target domain data segmented will be needed as the input of neural network;
Step 2: the Chinese character of target domain data that will need to segment, using with dictionary identical in training process, It is mapped as vector expression;
Step 3: vector is indicated input feature vector convolutional layer, extraction obtains character representation;
Step 4: character representation, which is inputted classification convolutional layer, obtains the prediction label probability of each character;
Step 5: decoding prediction label probability input Viterbi (Viterbi) algorithm of each character to obtain optimal sequence Column complete participle.
One preferred embodiment, as shown in figure 3, each of sentence character is mapped as a dense vector first, to Amount dimension is n, and by convolution, extraction obtains the feature of each word in a word;In the training process, convolution is extracted to obtain Feature two parts are divided into source domain and target domain, source domain data characteristics is calculated every by paying attention to power module The weight of words;By source domain data characteristics and multiplied by weight, first input calculated as MMD;Target domain data are special Levy second input calculated as MMD;The numerical value of MMD is calculated multiplied by regularization term weight as loss letter in MMD module Several regularization terms;Source domain feature obtains the prediction probability of each label of each character by classification convolutional layer, passes through CRF calculates log-likelihood probability;Negative log-likelihood probability is added to obtain final damage with MMD regularization term as objective function Lose function;Model is to minimize loss function as target, by BP algorithm undated parameter;In non-training situation, by classification convolution The prediction probability for each character each label that layer obtains is directly over viterbi algorithm, calculates last sequence label, Complete participle.
The embodiment of the present invention one, which passes through, utilizes traditional MMD method, realizes in Chinese word segmentation task, feature level Field migration, extends the field moving method of Chinese word segmentation;It will notice that power module introduces during calculating MMD, enable to Model independently selects the source domain data sample beneficial for target domain data, inhibits the noise in the transition process of field, real Existing more efficient field migration, alleviates the pressure of extensive mark corpus, improves field migration in small-scale labeled data Accuracy rate on collection.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (10)

1. a kind of field moving method based on convolutional network in Chinese word segmentation task, which is characterized in that do not marked in training When the target domain corpus data of note, this method comprises:
Step 1: corpus is divided into source domain data and target domain data, by the number in the field containing sentence negligible amounts According to, by recycling, add to identical as another field sentence quantity;
Step 2: by the Chinese character of source domain and target domain, using same dictionary, be mapped as vector expression, input to Participle text, that is, numerical value turns to each character vector and arranges the numerical matrix being formed by connecting;
Step 3: numerical matrix input feature vector convolutional layer is extracted to obtain source domain character representation and target domain mark sheet Show;
Step 4: the source domain character representation input of extraction is paid attention to power module, weight vectors, weight vectors dot product is calculated The source domain character representation that the source domain character representation of extraction is weighted;
Step 5: using the source domain character representation of weighting and the target domain character representation of extraction as calculating the two of MMD A input obtains MMD calculated result;
Step 6: the source domain character representation input classification convolutional layer of extraction is obtained the prediction label probability of each character;
Step 7: by the label probability of each character and true label probability input condition random field (Conditional Random Field, CRF), calculate likelihood probability;
Step 8: negative logarithm and MMD calculated result weighted sum of the loss function by likelihood probability, MMD calculated result is as just Then item is calculated by back-propagation algorithm (Back Propagation, BP) and updates each layer weight of network.
2. the method as described in claim 1, which is characterized in that contain the target domain corpus data marked on a small quantity in training When, step 6 is replaced, is replaced as follows:
Step 6: the input classification of the target domain character representation by the source domain character representation of extraction and on a small quantity containing true tag Convolutional layer obtains the prediction label probability of each character.
3. the method as described in claim 1, which is characterized in that in non-training situation, when Chinese word segmentation, by step 1 to step Eight are replaced, and are replaced as follows:
Step 1: the target domain data segmented will be needed as the input of neural network;
Step 2: the Chinese character of target domain data that will need to segment, using with dictionary identical in training process, mapping For vector expression;
Step 3: vector is indicated input feature vector convolutional layer, extraction obtains character representation;
Step 4: character representation, which is inputted classification convolutional layer, obtains the prediction label probability of each character;
Step 5: decode prediction label probability input Viterbi (Viterbi) algorithm of each character to obtain optimal sequence, it is complete At participle.
4. the method as described in claim 1, which is characterized in that in the step 2, by the Chinese of source domain and target domain Character is mapped as vector expression using same dictionary, comprising:
The mapping dictionary of random initializtion is the identical dense vector of identical character random initializtion using word embedding grammar It indicates, then each Chinese character of corpus data is mapped as dense vector expression by mapping dictionary;
Trained mapping dictionary utilizes bag of words Skip-Gram or Continuous Bag-of-Words (CBOW), instruction Getting the vector comprising certain word information indicates, each Chinese character of corpus data is mapped as by mapping dictionary Dense vector indicates.
5. the method as described in claim 1, which is characterized in that in the step 3, by numerical matrix input feature vector convolutional layer Extraction obtains character representation, calculates as follows:
Wherein, m ∈ Rd×wThe convolution kernel for being w for window size, d is identical as the line number of input matrix x,Indicate convolution operation, x For the output of numerical matrix or upper layer feature convolutional layer, b is bias term, and f is line rectification function (Rectified Linear Unit, ReLU), y is the vector that dimension is n, the feature that vector y, that is, feature convolutional layer extracts.
6. the method as described in claim 1, which is characterized in that in the step 4, the source domain character representation of extraction is defeated Enter to pay attention to power module, weight vectors are calculated, the source neck that the source domain character representation that weight vectors dot product is extracted is weighted Characteristic of field indicates, calculates as follows:
Wherein, k ∈ Ri×l×dFor weight matrix, i indicates the quantity of input neural network sentence, and l indicates the fixed length of input sentence Degree, d are characterized the dimension of expression, and ⊙ indicates that dot product, the i character representation that y is characterized convolutional layer extraction are formed by connecting, and g is indicated The second of dot product result and third dimension are first subjected to average computation, then do softmax calculating,The source domain weighted is special Sign indicates.
7. the method as described in claim 1, which is characterized in that in the step 5, MMD calculation formula is as follows:
Wherein,Indicate MMD calculated result, p, q are respectively the distribution of two FIELD Datas, nsIndicate the data input of source domain Sum, k () indicate gaussian kernel function, xsAnd xtRespectively indicate source domain weighted featureAnd target domain feature y.
8. the method as described in claim 1, which is characterized in that in the step 6, extracted by convolutional layer of classifying every The prediction label probability calculation process of one character is consistent with feature convolutional layer, but the ReLU that feature convolutional layer is used is non-linear Function f replaces with softmax calculating.
9. the method as described in claim 1, which is characterized in that in the step 7, likelihood probability calculating process is as follows:
Wherein, S and y*Input sentence and the truth sentence sequence label are respectively indicated, score is indicated each character The prediction label probability of sentence is calculated by a transfer matrix for prediction label probability, and y ' is the prediction of all possible sentences Label probability.
10. the method as described in claim 1, which is characterized in that in the step 8, loss function calculates as follows:
Wherein, λ indicates the weight of regularization term MMD.
CN201910487638.5A 2019-06-05 2019-06-05 Domain migration on Chinese word segmentation task based on convolutional network Active CN110196980B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910487638.5A CN110196980B (en) 2019-06-05 2019-06-05 Domain migration on Chinese word segmentation task based on convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910487638.5A CN110196980B (en) 2019-06-05 2019-06-05 Domain migration on Chinese word segmentation task based on convolutional network

Publications (2)

Publication Number Publication Date
CN110196980A true CN110196980A (en) 2019-09-03
CN110196980B CN110196980B (en) 2020-08-04

Family

ID=67754062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910487638.5A Active CN110196980B (en) 2019-06-05 2019-06-05 Domain migration on Chinese word segmentation task based on convolutional network

Country Status (1)

Country Link
CN (1) CN110196980B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750974A (en) * 2019-09-20 2020-02-04 成都星云律例科技有限责任公司 Structured processing method and system for referee document
CN110765775A (en) * 2019-11-01 2020-02-07 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences
CN111008271A (en) * 2019-11-20 2020-04-14 佰聆数据股份有限公司 Neural network-based key information extraction method and system
CN111091004A (en) * 2019-12-18 2020-05-01 上海风秩科技有限公司 Training method and training device for sentence entity labeling model and electronic equipment
CN111127336A (en) * 2019-11-18 2020-05-08 复旦大学 Image signal processing method based on self-adaptive selection module
CN111178149A (en) * 2019-12-09 2020-05-19 中国资源卫星应用中心 Automatic remote sensing image water body extraction method based on residual pyramid network
CN111767718A (en) * 2020-07-03 2020-10-13 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN111984791A (en) * 2020-09-02 2020-11-24 南京信息工程大学 Long text classification method based on attention mechanism
CN112415408A (en) * 2020-11-10 2021-02-26 南昌济铃新能源科技有限责任公司 Power battery SOC estimation method
CN112580343A (en) * 2020-11-03 2021-03-30 北京字节跳动网络技术有限公司 Model generation method, question and answer quality judgment method, device, equipment and medium
CN113076750A (en) * 2021-04-26 2021-07-06 华南理工大学 Cross-domain Chinese word segmentation system and method based on new word discovery
CN114429129A (en) * 2021-12-22 2022-05-03 南京信息工程大学 Literature mining and material property prediction method
CN114580412A (en) * 2021-12-29 2022-06-03 西安工程大学 Clothing entity identification method based on field adaptation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967253A (en) * 2017-10-27 2018-04-27 北京大学 A kind of low-resource field segmenter training method and segmenting method based on transfer learning
CN108197117A (en) * 2018-01-31 2018-06-22 厦门大学 A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism
CN109753566A (en) * 2019-01-09 2019-05-14 大连民族大学 The model training method of cross-cutting sentiment analysis based on convolutional neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967253A (en) * 2017-10-27 2018-04-27 北京大学 A kind of low-resource field segmenter training method and segmenting method based on transfer learning
CN108197117A (en) * 2018-01-31 2018-06-22 厦门大学 A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism
CN109753566A (en) * 2019-01-09 2019-05-14 大连民族大学 The model training method of cross-cutting sentiment analysis based on convolutional neural networks

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ZUYI BAO等: "Neural Domain Adaptation with Contextualized Character Embedding for Chinese Word Segmentation", 《SPRING INTERNATIONAL》 *
ZUYI BAO等: "Neural Regularized Domain Adaptation for ChineseWord Segmentation", 《PROCEEDINGS OF THE 9TH SIGHAN WORKSHOP ON CHINESE LANGUAGE PROCESSING》 *
刘玉德: "基于深度学习的中文分词方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
宋鹏 等: "基于特征迁移学习方法的垮库语音情感识别", 《清华大学学报》 *
皋军 等: "一种基于局部加权均值的领域适应学习框架", 《自动化学报》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750974A (en) * 2019-09-20 2020-02-04 成都星云律例科技有限责任公司 Structured processing method and system for referee document
CN110750974B (en) * 2019-09-20 2023-04-25 成都星云律例科技有限责任公司 Method and system for structured processing of referee document
CN110765775A (en) * 2019-11-01 2020-02-07 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences
CN110765775B (en) * 2019-11-01 2020-08-04 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences
CN111127336B (en) * 2019-11-18 2023-05-02 复旦大学 Image signal processing method based on self-adaptive selection module
CN111127336A (en) * 2019-11-18 2020-05-08 复旦大学 Image signal processing method based on self-adaptive selection module
CN111008271B (en) * 2019-11-20 2022-06-24 佰聆数据股份有限公司 Neural network-based key information extraction method and system
CN111008271A (en) * 2019-11-20 2020-04-14 佰聆数据股份有限公司 Neural network-based key information extraction method and system
CN111178149A (en) * 2019-12-09 2020-05-19 中国资源卫星应用中心 Automatic remote sensing image water body extraction method based on residual pyramid network
CN111178149B (en) * 2019-12-09 2023-09-29 中国四维测绘技术有限公司 Remote sensing image water body automatic extraction method based on residual pyramid network
CN111091004B (en) * 2019-12-18 2023-08-25 上海风秩科技有限公司 Training method and training device for sentence entity annotation model and electronic equipment
CN111091004A (en) * 2019-12-18 2020-05-01 上海风秩科技有限公司 Training method and training device for sentence entity labeling model and electronic equipment
CN111767718A (en) * 2020-07-03 2020-10-13 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN111767718B (en) * 2020-07-03 2021-12-07 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN111984791A (en) * 2020-09-02 2020-11-24 南京信息工程大学 Long text classification method based on attention mechanism
CN111984791B (en) * 2020-09-02 2023-04-25 南京信息工程大学 Attention mechanism-based long text classification method
CN112580343A (en) * 2020-11-03 2021-03-30 北京字节跳动网络技术有限公司 Model generation method, question and answer quality judgment method, device, equipment and medium
CN112415408A (en) * 2020-11-10 2021-02-26 南昌济铃新能源科技有限责任公司 Power battery SOC estimation method
CN113076750B (en) * 2021-04-26 2022-12-16 华南理工大学 Cross-domain Chinese word segmentation system and method based on new word discovery
CN113076750A (en) * 2021-04-26 2021-07-06 华南理工大学 Cross-domain Chinese word segmentation system and method based on new word discovery
CN114429129A (en) * 2021-12-22 2022-05-03 南京信息工程大学 Literature mining and material property prediction method
CN114580412A (en) * 2021-12-29 2022-06-03 西安工程大学 Clothing entity identification method based on field adaptation
CN114580412B (en) * 2021-12-29 2024-06-04 西安工程大学 Clothing entity identification method based on field adaptation

Also Published As

Publication number Publication date
CN110196980B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
CN110196980A (en) A kind of field migration based on convolutional network in Chinese word segmentation task
Abid et al. Sentiment analysis through recurrent variants latterly on convolutional neural network of Twitter
CN108763326B (en) Emotion analysis model construction method of convolutional neural network based on feature diversification
CN107145483B (en) A kind of adaptive Chinese word cutting method based on embedded expression
Zhang et al. Neural networks incorporating dictionaries for Chinese word segmentation
Dong et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition
Manoharan Capsule network algorithm for performance optimization of text classification
Prusa et al. Improving deep neural network design with new text data representations
CN109766524B (en) Method and system for extracting combined purchasing recombination type notice information
CN110008338B (en) E-commerce evaluation emotion analysis method integrating GAN and transfer learning
CN110765775B (en) Self-adaptive method for named entity recognition field fusing semantics and label differences
US20240177047A1 (en) Knowledge grap pre-training method based on structural context infor
CN108628823A (en) In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
Zhuang et al. Natural language processing service based on stroke-level convolutional networks for Chinese text classification
CN110263325A (en) Chinese automatic word-cut
CN112699685A (en) Named entity recognition method based on label-guided word fusion
Boudad et al. Exploring the use of word embedding and deep learning in arabic sentiment analysis
Naqvi et al. Roman Urdu news headline classification empowered with machine learning
Su et al. Low‐Rank Deep Convolutional Neural Network for Multitask Learning
Liu et al. Research on advertising content recognition based on convolutional neural network and recurrent neural network
Huang et al. Multi-view opinion mining with deep learning
Wang et al. Joint Character‐Level Convolutional and Generative Adversarial Networks for Text Classification
Hu et al. Scalable frame resolution for efficient continuous sign language recognition
Wu et al. Conditional consistency regularization for semi-supervised multi-label image classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant