CN104035996A

CN104035996A - Domain concept extraction method based on Deep Learning

Info

Publication number: CN104035996A
Application number: CN201410259300.1A
Authority: CN
Inventors: 吕钊; 张青
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2014-06-11
Filing date: 2014-06-11
Publication date: 2014-09-10
Anticipated expiration: 2034-06-11
Also published as: CN104035996B

Abstract

The invention discloses a domain concept extraction method based on Deep Learning. The method includes extracting samples in a training corpus, adopting word frequency, document frequency, inverse document frequency, word length, word frequency variance and domain consensus as feature vectors, training and acquiring a deep network model, which is capable of representing the complex mapping correspondence between the word-type filed concept multi-dimensional feature vectors and class labels, on the basis of the Deep Learning technology, and finally comparing the deep network model established on the basis of the Deep Learning technology, an optimized BP neural network model and mainstream KNN and SVM models in the testing step. According to the tests, the optimal test effect is acquired through the deep network model established on the basis of the Deep Learning technology.

Description

Field concept abstracting method based on Deep Learning

Technical field

The present invention relates to field concept, field concept Automatic Extraction, artificial neural network, Deep Learning and degree of depth conviction network technology field, specifically a kind of Feature Extraction Method that has proposed applicable word type field concept feature based on Deep Learning.

Background technology

Field concept is a kind of form of expression of domain knowledge, and people come certain object in description field, communication sphere information with field concept.For example: " note ", " CRBT " belong to the concept of moving communicating field, " data structure ", " computer network " belong to the concept of computer realm.Say in a sense, field concept is the mankind abstract for things in cognitive process, is a kind of form of expression of domain knowledge in text, and reflects to a certain extent the development and change in this field.Field concept uses comparatively frequent conventionally in specific field, uses less at other field.

According to whether being formed by more than two word, field concept can be divided into word type and compound two classes.Existing research is mostly for compound field concept, and seldom has research separately for word type field concept.But, existing word V-neck V territory concept extraction method ubiquity the problem that accuracy rate is not high, feature selecting is single, researchers have often only taked once having completed the screening for field concept and non-field concept to two kinds of a small amount of features, for the distinguishing ability of noise a little less than.Meanwhile, in the science not that arranges of feature weight and threshold value, generally need to select comparatively suitable value according to the result of test of many times, artificial intervention is larger, and the in the situation that of change language material scale, weight and threshold value also need to make corresponding amendment, portable poor.So the extraction effect of word type field concept is in urgent need to be improved.

Neural network is the machine learning method of a class maturation, and it provides a kind of practicality and effective method goes out the function of real number value or vector value from input data learning, and has good robustness for the noise in data.Therefore, neural network is applicable to for the mapping relations between learning word type field concept multidimensional characteristic vectors and corresponding classification very much.The neural network that possesses multiple hidden layers has stronger ability to express, and Deep Learning is exactly mainly the problem concerning study of using the neural network that solves many hidden layers.

Summary of the invention

The object of the invention is for a little less than the unsupervised method learning ability of tradition, field concept extracts the problem of poor effect and a kind of field concept abstracting method based on Deep Learning of providing, field concept extraction problem is converted into two classification problems, adopt the more statistical nature of horn of plenty, utilize the field concept extraction algorithm of Deep Learning, Deep Learning and field concept extraction task are combined, carry out unsupervised pre-training by building degree of depth conviction net, then coordinate traditional neural network model to have the adjustment of supervision, the degree of depth network model and the KNN that finally train, SVM model is compared, in test data set, obtain the highest F value.

The concrete technical scheme that realizes the object of the invention is:

A field concept abstracting method based on Deep Learning, the method comprises following concrete steps:

A) training stage

First extract the positive negative sample in training corpus, the row labels of going forward side by side; Then combined training corpus and background corpus, aligns negative sample and carries out feature extraction, structural attitude vector set; Finally utilize training under the environment of set of eigenvectors and the corresponding degree of deep learning tool case that is marked at matlab to obtain degree of depth network DN model;

B) test phase

Target is to utilize the degree of depth network DN model that the training stage obtains to check the classifying quality to testing material storehouse; First successively candidate item extraction, feature extraction are carried out in testing material storehouse, structural attitude vector set; Then set of eigenvectors is inputted to degree of depth network DN model, utilized degree of depth network DN model that proper vector is automatically judged and identified, realize the classification of the candidate item to testing material storehouse; Finally obtain correct field concept collection according to result and the manual examination and verification of classification.

Described structural attitude vector set is to form with following characteristics:

1) word frequency (TF);

2) document frequency (DF);

3) inverse document frequency (IDF);

4) word length (LEN);

5) word frequency variance (TV);

6) (DC) unanimously spent in field.

In described step a), training obtains degree of depth network model DN, specifically comprises:

I) only utilize the proper vector of training data to carry out nothing supervision to learn construction depth conviction net (Deep Belief Nets, DBN);

Import a proper vector into input layer, the restriction Boltzmann machine (Restricted Boltzmann Machine, RBM) of training ground floor; Then fixing ground floor RBM parameter, the input using the output of ground floor RBM as second layer RBM, training second layer RBM; The parameter of fixing front two-layer RBM, utilizes the output of second layer RBM to complete the training of the 3rd layer of RBM similarly; When having learnt after whole proper vectors, the training process of whole DBN also finishes;

II) utilize the parameter initialization degree of depth network DN of degree of depth conviction net DBN, then adopt back-propagation algorithm, there is supervision according to the classification mark of training sample and finely tune degree of depth network DN parameter, when iteration or error through some number of times are decreased in 0.001 ~ 0.005 scope, the parameter adjustment of Part II finishes; So far, the training stage of degree of depth network DN model also just completes.

The classification of the candidate item to testing material storehouse in described step b) is using the extraction of field concept as binary classification, i.e. " field concept " and " non-field concept "; According to the output valve of DN model, obtain the co-occurrence probabilities p (x, y) of candidate feature x and classification y, with it weigh the degree of confidence that candidate's concept belongs to classification y in the situation that being characterized as x; X represents the proper vector of candidate's concept, and classification y represents one of " field concept ", " non-field concept " two classes; The sorter obtaining by training corpus utilizes the classification of sorter automatic discrimination candidate concept in test data set.

The invention provides a kind of field concept abstracting method based on Deep Learning, comprise the classification problem of field concept in extracting and the field concept extraction algorithm of the Deep Learning of proposition, for the extraction of word type field concept, the method has better recognition effect to field word than traditional neural network model, classical KNN model and SVM model on identical experiment data set.

The present invention combines Deep Learning and field concept extraction task, carry out unsupervised pre-training by building degree of depth conviction net, then coordinate traditional neural network model to have the adjustment of supervision, finally train degree of depth network model and in test data set, obtain higher accuracy rate, also ensured certain recall rate, the recognition performance of entirety is best simultaneously.

Utilize the present invention, can effectively obtain based on Deep Learning technology the extraction result of word type field concept, there is positive effect for researchs such as information retrieval, mechanical translation, body learnings.

Brief description of the drawings

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is training process flow diagram of the present invention;

Fig. 3 is test flow chart of the present invention;

Fig. 4 is degree of depth network model structural drawing of the present invention;

Fig. 5 is different disaggregated model experimental index comparison diagrams.

Embodiment

The present invention is a kind of field concept abstracting method based on Deep Learning, the method comprises that classification and the field concept of Deep Learning of field concept in extracting extracts, wherein: the classification during described field concept extracts, using field concept extraction as binary classification, i.e. " field concept " and " non-field concept " two classes.Adopt the thought of machine learning, by training sample acquisition characteristics, structural classification device utilizes the classification of sorter automatic discrimination candidate concept in test data set.Particularly, classification is the co-occurrence probabilities p (x, y) that estimates candidate concept characteristic x and classification y, with it weigh the degree of confidence that candidate's concept belongs to classification y in the situation that being characterized as x.The x here represents the proper vector of candidate's concept, and classification y represents one of " field concept ", " non-field concept " two classes.

The field concept of described Deep Learning extracts (Deep Learning based Domain Concept Extraction Algorithm, DLDoC) be divided into generally two stages of training and testing, as shown in Figure 1, first utilize training data study to obtain degree of depth network (Deep Nets by training module, DN) model then utilizes previous step to train the DN model obtaining to carry out automatic classification identification to test data in test module.For classification results, by the mode of manual examination and verification, finally obtain correct field concept collection, concrete steps are as follows:

I) training stage: the training stage completes the structure of degree of depth network model.As shown in Figure 2, first extract the positive negative sample in training corpus, the row labels of going forward side by side; Then combined training corpus and background corpus, carries out feature extraction, structural attitude vector set to the positive negative sample obtaining; Finally utilize set of eigenvectors and corresponding flag data training pattern.Whole training process can be understood as the mapping from training corpus to model, wherein passes through successively the conversion of sample space, feature space.

II) test phase: test phase is to utilize the DN model that previous step training process obtains to check the recognition effect to test data set.As shown in Figure 3, similar with training process, first successively candidate item extraction, feature extraction are carried out in testing material storehouse, structural attitude vector set; Then set of eigenvectors is inputted to DN model, it can automatically be judged and identify proper vector, thereby realize the classification to candidate item; Finally compare according to the result of classification and artificial mark, thereby calculate overall recognition effect.

Described structural attitude vector set:

The method of the TF-IDF adopting for most researchers, the present invention chooses following several feature:

1) word frequency (TF);

2) document frequency (DF);

3) inverse document frequency (IDF);

4) word length (LEN);

5) word frequency variance (TV);

6) (DC) unanimously spent in field.

The structure of described degree of depth network DN model, as shown in Figure 4:

I) only utilize the proper vector of training data to carry out nothing supervision to learn construction depth conviction net (Deep Belief Nets, DBN).Import a proper vector into input layer, the RBM of training ground floor; Then fixing ground floor RBM parameter, the input using the output of ground floor RBM as second layer RBM, training second layer RBM; The parameter of fixing front two-layer RBM, utilizes the output of second layer RBM to complete the training of the 3rd layer of RBM similarly.When having learnt after whole proper vectors, the training process of whole DBN is also through with.

II) utilize the parameter initialization DN of DBN, then adopt back-propagation algorithm, have supervision according to the classification mark of training sample and finely tune, when iteration or error through some number of times are decreased in 0.001 ~ 0.005 scope, the parameter adjustment of Part II is just through with.So far, the training of DN model has also just completed, and can be used for the classification of unknown sample to predict.

Embodiment

Taking military field material as example, the present invention is further described by reference to the accompanying drawings below.

Consult Fig. 1, first from training corpus, carry out sample extraction, carry out feature extraction from sample, select proper vector, obtain training pattern-DN model, the DN model obtaining carries out automatic classification identification to test data.For classification results, can be by the mode of manual examination and verification, finally obtain correct field concept collection.

In the present embodiment, as shown in Figure 2, realize the conversion of training corpus to sample space, the present invention chooses above several latent structure proper vector, and table 1 has been listed the eigenwert of the part training sample that the present invention extracts in military field material.

Table 1 military field part training sample feature

Positive and negative sample set and characteristic of correspondence vector that model training utilizes first two steps to extract and obtains are gathered, relation between learning characteristic vector sum sample labeling data, train degree of depth network model (DN), this model has completed the mapping from proper vector to mark for each sample, namely obtains the parameter of DN model.

In the present embodiment, as shown in Figure 3, select test sample book, " commandant " proper vector: 29 6 4.8078 2 208.9667 1.4144, after test, this proper vector is identified as positive example, illustrates that DN model has good recognition capability to sample set.

The present invention the DN model building is combined with neural network simultaneously and with traditional KNN model and SVM model, contrast, as shown in Figure 5, adopt the DBN DBN+NN model of training in advance can obtain relatively good and stable accuracy rate, exceeded respectively 13.05 percentage points, KNN model and SVM model and 23.09 percentage points.In the F value index of reflection overall performance, the DBN+NN model that the present invention builds has obtained mxm., exceedes 2.53 percentage points, SVM model, and basic NN2 model and the F value of KNN model are more or less the same.

Claims

1. the field concept abstracting method based on Deep Learning, is characterized in that the method comprises following concrete steps:

A) training stage

B) test phase

First successively candidate item extraction, feature extraction are carried out in testing material storehouse, structural attitude vector set; Then set of eigenvectors is inputted to degree of depth network DN model, utilized degree of depth network DN model that proper vector is automatically judged and identified, realize the classification of the candidate item to testing material storehouse; Finally obtain correct field concept collection according to result and the manual examination and verification of classification.

2. method according to claim 1, is characterized in that described structural attitude vector set, is to form with following characteristics:

Word frequency (TF);

Document frequency (DF);

Inverse document frequency (IDF);

Word length (LEN);

Word frequency variance (TV);

(DC) unanimously spent in field.

3. method according to claim 1, is characterized in that in described step a), training obtains degree of depth network model DN, specifically comprises:

I) only utilize the proper vector of training data to carry out nothing supervision to learn construction depth conviction net DBN;

Import a proper vector into input layer, the restriction Boltzmann machine RBM of training ground floor; Then fixing ground floor RBM parameter, the input using the output of ground floor RBM as second layer RBM, training second layer RBM; The parameter of fixing front two-layer RBM, utilizes the output of second layer RBM to complete the training of the 3rd layer of RBM similarly; When having learnt after whole proper vectors, the training process of whole DBN also finishes;

4. method according to claim 1, the classification that it is characterized in that the candidate item to testing material storehouse in described step b) is using the extraction of field concept as binary classification, i.e. " field concept " and " non-field concept "; According to the output valve of DN model, obtain the co-occurrence probabilities p (x, y) of candidate feature x and classification y, with it weigh the degree of confidence that candidate's concept belongs to classification y in the situation that being characterized as x; X represents the proper vector of candidate's concept, and classification y represents one of " field concept ", " non-field concept " two classes; The sorter obtaining by training corpus utilizes the classification of sorter automatic discrimination candidate concept in test data set.