CN104035996B

CN104035996B - Field concept abstracting method based on Deep Learning

Info

Publication number: CN104035996B
Application number: CN201410259300.1A
Authority: CN
Inventors: 吕钊; 张青
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2014-06-11
Filing date: 2014-06-11
Publication date: 2017-06-16
Anticipated expiration: 2034-06-11
Also published as: CN104035996A

Abstract

The invention discloses a kind of field concept abstracting method based on Deep Learning,Sample extraction is carried out to training corpus first,Choose word frequency,Document frequency,Anti- document frequency,Word length,Word frequency variance and field consistent degree are used as characteristic vector,Secondly based on Deep Learning technologies,Train depth network model,The model can effectively represent the complex mapping relation between the characteristic vector and category label of word type field concept various dimensions,The last depth network model and improved BP-NN model and the KNN of main flow that Deep Learning technique constructions will be based in test phase,SVM models are contrasted,Experiment shows that the depth network model obtained using Deep Learning technique drills achieves optimal experiment effect.

Description

Field concept abstracting method based on Deep Learning

Technical field

Extracted automatically the present invention relates to field concept, field concept, artificial neural network, Deep Learning and depth Conviction network technology field, it is specifically a kind of that suitable word type field concept feature is proposed based on Deep Learning Feature Extraction Method.

Background technology

Field concept is a kind of form of expression of domain knowledge, and it is right that people describe in field certain using field concept As communication sphere information.For example：" short message ", " CRBT " belong to the concept of moving communicating field, " data structure ", " computer network Network " then belongs to the concept of computer realm.Say in a sense, field concept be the mankind in cognitive process for things It is abstract, it is a kind of domain knowledge form of expression in the text, and reflect the development and change in the field to a certain extent.Neck Domain concept is generally used more frequently in specific field, and in other field then using less.

Constituted according to whether by two or more word, field concept can be divided into word type and compound two class.It is existing Research mostly be directed to compound field concept, and it is few research individually for word type field concept.However, existing list Morphological pattern field concept abstracting method generally existing the problem that accuracy rate is not high, feature selecting is single, and researchers often only adopt Take once the screening for field concept and non-field concept is completed to two kinds of a small amount of features, for the discriminating energy of noise Power is weaker.Meanwhile, inadequate science, generally requires the result according to test of many times to select in the setting of feature weight and threshold value More suitable value, artificial intervention is larger, and in the case where language material scale is changed, weight and threshold value are also required to make phase The modification answered, it is portable poor.So, the extraction effect of word type field concept is in urgent need to be improved.

Neutral net is the ripe machine learning method of a class, and it provides a kind of practical and effective method from input number Go out the function of real number value or vector value according to learning, and there is good robustness for the noise in data.Therefore, god It is especially suitable for for learning the mapping relations between word type field concept multidimensional characteristic vectors and correspondence classification through network.Possess The neutral net of multiple hidden layers possesses stronger ability to express, and Deep Learning are exactly mainly to be used for solving many hidden layers Neutral net problem concerning study.

The content of the invention

The purpose of the present invention is directed to that the unsupervised method learning ability of tradition is weak, field concept extracts asking for effect on driving birds is not good A kind of field concept abstracting method based on Deep Learning inscribed and provide, two are converted into by field concept extraction problem Classification problem, employs the statistical nature of more horn of plenty, using the field concept extraction algorithm of Deep Learning, by Deep Learning and field concept extract task and are combined, and unsupervised pre-training, Ran Houpei are carried out by building depth conviction net The adjustment that traditional neural network model carries out having supervision is closed, the depth network model for finally training is compared with KNN, SVM model, Highest F values are achieved in test data set.

Realizing the concrete technical scheme of the object of the invention is：

A kind of field concept abstracting method based on Deep Learning, the method includes step in detail below：

a）Training stage

The positive negative sample in training corpus is extracted first, and is marked；Then in conjunction with training corpus and background language Material storehouse, aligning negative sample carries out feature extraction, structural feature vector set；Finally using set of eigenvectors and corresponding it is marked at Training obtains depth network DN models in the environment of the deep learning tool box of matlab；

b）Test phase

Target is the depth network DN models that are obtained using the training stage checks the classifying quality to testing material storehouse；It is first Candidate item extraction, feature extraction, structural feature vector set are first carried out to testing material storehouse successively；Then set of eigenvectors is input into Depth network DN models, are automatically judged and are recognized using depth network DN models to characteristic vector, are realized to test language Expect the classification of the candidate item in storehouse；Result and manual examination and verification finally according to classification obtain correct field concept collection.

The structural feature vector set, is constituted with following characteristics：

1) word frequency（TF）；

2) document frequency（DF）；

3) inverse document frequency（IDF）；

4) word length（LEN）；

5) word frequency variance（TV）；

6) field consistent degree（DC）.

The step a）Middle training obtains depth network model DN, specifically includes：

ⅰ）Carry out unsupervised learning to carry out construction depth conviction net merely with the characteristic vector of training data（Deep Belief Nets, DBN）；

By an incoming input layer of characteristic vector, the limitation Boltzmann machine of ground floor is trained（Restricted Boltzmann Machine, RBM）；Then ground floor RBM parameters are fixed, using the output of ground floor RBM as second layer RBM Input, training second layer RBM；Similarly the parameter of fixed preceding two-layer RBM, third layer is completed using the output of second layer RBM The training of RBM；After the characteristic vector of whole has been learnt, the training process of whole DBN also terminates；

ⅱ）Using the parameter initialization depth network DN of depth conviction net DBN, then using back-propagation algorithm, according to The category label of training sample finely tunes depth network DN parameters with having carried out supervision, when iteration or error by some number of times It is decreased in the range of 0.001 ~ 0.005, the parameter adjustment of Part II terminates；So far, the training stage of depth network DN models Also just complete.

The step b）In to the classification of the candidate item in testing material storehouse be the extraction using field concept as binary classification, That is " field concept " and " non-field concept "；According to the output valve of DN models, the co-occurrence probabilities p of candidate feature x and classification y is obtained (x, y), the confidence level that a candidate concepts belong to classification y in the case where x is characterized as is weighed with it；X represents candidate concepts Characteristic vector, and classification y represents one of " field concept ", " non-field concept " two classes；By dividing that training corpus is obtained Class device, using the classification of grader automatic discrimination candidate concepts in test data set.

Extracted the invention provides a kind of field concept abstracting method based on Deep Learning, including field concept In classification problem and propose Deep Learning field concept extraction algorithm, for the extraction of word type field concept, The method has on identical experiment data set than traditional neural network model, classical KNN models and SVM models to domain term More preferable recognition effect.

Deep Learning and field concept are extracted task and are combined by the present invention, are carried out by building depth conviction net Unsupervised pre-training, then coordinates traditional neural network model to carry out the adjustment for having supervision, finally trains depth network mould Type obtains accuracy rate higher in test data set, while also ensure that certain recall rate, overall recognition performance is best.

Using the present invention, the extraction result that Deep Learning technologies effectively obtain word type field concept can be based on, There is positive effect for the research such as information retrieval, machine translation, body learning.

Brief description of the drawings

Fig. 1 is flow chart of the invention；

Fig. 2 is training flow chart of the invention；

Fig. 3 is test flow chart of the invention；

Fig. 4 is depth network architecture figure of the invention；

Fig. 5 is different classifications model experiment index comparison diagram.

Specific embodiment

The present invention is a kind of field concept abstracting method based on Deep Learning, and the method is taken out including field concept The field concept of classification and Deep Learning in taking is extracted, wherein：Classification in the field concept extraction, by field Concept extraction is used as binary classification, i.e. " field concept " and " non-field concept " two classes.Using the thought of machine learning, by instruction Practice sample collection feature, structural classification device, using the classification of grader automatic discrimination candidate concepts in test data set.Specifically For, classification is co-occurrence probabilities p (x, y) for estimating candidate concepts feature x and classification y, a candidate concepts is weighed with it and is existed Belong to the confidence level of classification y in the case of being characterized as x.Here x represents the characteristic vector of candidate concepts, and classification y represents " neck One of domain concept ", " non-field concept " two classes.

The field concept of the Deep Learning is extracted（Deep Learning based Domain Concept Extraction Algorithm, DLDoC）Generally it is divided into training and two stages of test, as shown in figure 1, first by instruction Practice module and obtain depth network using training data study（Deep Nets, DN）Model, then utilizes upper one in test module The DN models that step training is obtained carry out automatic Classification and Identification to test data.For classification results, by way of manual examination and verification, Correct field concept collection is finally obtained, is comprised the following steps that：

ⅰ）Training stage：Training stage completes the structure of depth network model.As shown in Fig. 2 extracting training corpus first Positive negative sample in storehouse, and be marked；Then in conjunction with training corpus and background corpus, the positive negative sample to obtaining is carried out Feature extraction, structural feature vector set；Finally utilize set of eigenvectors and corresponding flag data training pattern.Entirely trained Journey can be understood as the mapping from training corpus to model, wherein sequentially passing through the conversion of sample space, feature space.

ⅱ）Test phase：Test phase is the DN models that are obtained using previous step training process to be checked to test data The recognition effect of collection.As shown in figure 3, it is similar with training process, candidate item extraction, feature are carried out to testing material storehouse successively first Extract, structural feature vector set；Then by set of eigenvectors be input into DN models, it characteristic vector can automatically be judged and Identification, so as to realize the classification to candidate item；Result and artificial mark finally according to classification are compared, so as to calculate Overall recognition effect.

The structural feature vector set：

The method of the TF-IDF used for most researchers, the present invention chooses following several features：

1）Word frequency（TF）；

2）Document frequency（DF）；

3）Inverse document frequency（IDF）；

4）Word length（LEN）；

5）Word frequency variance（TV）；

6）Field consistent degree（DC）.

The structure of the depth network DN models, as shown in Figure 4：

ⅰ）Carry out unsupervised learning to carry out construction depth conviction net merely with the characteristic vector of training data（Deep Belief Nets, DBN）.By an incoming input layer of characteristic vector, the RBM of ground floor is trained；Then ground floor RBM is fixed Parameter, using the output of ground floor RBM as the input of second layer RBM, trains second layer RBM；Two-layer RBM before similarly fixing Parameter, the training of third layer RBM is completed using the output of second layer RBM.After the characteristic vector of whole has been learnt, whole DBN Training process also finish.

ⅱ）Using the parameter initialization DN of DBN, then using back-propagation algorithm, according to the category label of training sample Finely tune with having carried out supervision, when the iteration or error by some number of times are decreased in the range of 0.001 ~ 0.005, Part II Parameter adjustment just finish.So far, the training of DN models is also just completed, and can be used to carry out the classification of unknown sample pre- Survey.

Embodiment

Below by taking military field material as an example, with reference to accompanying drawing, the present invention is further described.

Refering to Fig. 1, sample extraction is carried out first from training corpus, feature extraction is carried out from sample, select feature Vector, obtains training pattern-DN models, and the DN models for obtaining carry out automatic Classification and Identification to test data.Tied for classification Really, correct field concept collection can finally be obtained by way of manual examination and verification.

In the present embodiment, as shown in Fig. 2 realizing training corpus to the conversion of sample space, more than present invention selection Several latent structure characteristic vectors, table 1 lists the feature of the part training sample that the present invention is extracted in military field material Value.

The military field part training sample feature of table 1

Model training extracts the positive and negative sample set and corresponding characteristic vector set for obtaining, learning characteristic using first two steps Relation between vector sum sample labeling data, trains depth network model（DN）, the model is complete for each sample Into from characteristic vector to the mapping of mark, that is, the parameter for obtaining DN models.

In the present embodiment, as shown in figure 3, from test sample, " commandant " characteristic vector：29 6 4.8078 2 208.9667 1.4144, by after test, this characteristic vector is identified as positive example, illustrate DN models to sample set have compared with Good recognition capability.

The present invention simultaneously by build DN models combined with neutral net and with traditional KNN models and SVM models, enter Row contrast, as shown in figure 5, relatively good and stabilization accuracy rate can then be obtained using the DBN+NN models that DBN pre-training is crossed, 13.05 percentage points and 23.09 percentage points of KNN models and SVM models are exceeded respectively.Refer in the F values of reflection overall performance Put on, the DBN+NN models that the present invention builds obtain peak, more than 2.53 percentage points of SVM models, basic NN2 models F values with KNN models are more or less the same.

Claims

1. a kind of field concept abstracting method based on Deep Learning, it is characterised in that the method includes walking in detail below Suddenly：

A) training stage

The positive negative sample in training corpus is extracted first, and is marked；Then in conjunction with training corpus and background corpus, Aligning negative sample carries out feature extraction, structural feature vector set；Finally using set of eigenvectors and corresponding it is marked at matlab Deep learning tool box in the environment of training obtain depth network DN models, wherein, depth network DN models according to as follows walk Rapid training：

I) carry out unsupervised learning to carry out construction depth conviction net DBN merely with the characteristic vector of training data；

By an incoming input layer of characteristic vector, the limitation Boltzmann machine RBM of ground floor is trained；Then ground floor RBM is fixed Parameter, using the output of ground floor RBM as the input of second layer RBM, trains second layer RBM；Two-layer RBM before similarly fixing Parameter, the training of third layer RBM is completed using the output of second layer RBM；After the characteristic vector of whole has been learnt, entire depth The training process of conviction net DBN also terminates；

Ii) using the parameter initialization depth network DN of depth conviction net DBN, then using back-propagation algorithm, according to training The category label of sample finely tunes depth network DN parameters with having carried out supervision, when the iteration by some number of times or error reduce To in the range of 0.001~0.005, the parameter adjustment of Part II terminates, to complete the training stage of depth network DN models；

B) test phase

Carry out candidate item extraction, feature extraction, structural feature vector set to testing material successively first；Then by set of eigenvectors Input depth network DN models, are automatically judged and are recognized using depth network DN models to characteristic vector, are realized to surveying Try the classification of the candidate item of corpus；Result and manual examination and verification finally according to classification obtain correct field concept collection.

2. method according to claim 1, it is characterised in that the structural feature vector set, is constituted with following characteristics：

Word frequency (TF)；

Document frequency (DF)；

Inverse document frequency (IDF)；

Word length (LEN)；

Word frequency variance (TV)；

Field consistent degree (DC).

3. method according to claim 1, it is characterised in that to the classification of the candidate item of testing material in the step b) It is, using the extraction of field concept as binary classification, candidate concepts to be divided into field concept and the class of non-field concept two；According to The output valve of DN models, obtains co-occurrence probabilities p (x, y) of candidate feature x and classification y, and one is weighed with co-occurrence probabilities p (x, y) Individual candidate concepts belong to the confidence level of classification y in the case where x is characterized as；X represents the characteristic vector of candidate concepts, and classification y Represent one of field concept, the class of non-field concept two；The grader obtained by training corpus, is utilized in test data set The classification of grader automatic discrimination candidate concepts.