CN104035996B - Field concept abstracting method based on Deep Learning - Google Patents
Field concept abstracting method based on Deep Learning Download PDFInfo
- Publication number
- CN104035996B CN104035996B CN201410259300.1A CN201410259300A CN104035996B CN 104035996 B CN104035996 B CN 104035996B CN 201410259300 A CN201410259300 A CN 201410259300A CN 104035996 B CN104035996 B CN 104035996B
- Authority
- CN
- China
- Prior art keywords
- training
- classification
- field concept
- models
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
Abstract
The invention discloses a kind of field concept abstracting method based on Deep Learning,Sample extraction is carried out to training corpus first,Choose word frequency,Document frequency,Anti- document frequency,Word length,Word frequency variance and field consistent degree are used as characteristic vector,Secondly based on Deep Learning technologies,Train depth network model,The model can effectively represent the complex mapping relation between the characteristic vector and category label of word type field concept various dimensions,The last depth network model and improved BP-NN model and the KNN of main flow that Deep Learning technique constructions will be based in test phase,SVM models are contrasted,Experiment shows that the depth network model obtained using Deep Learning technique drills achieves optimal experiment effect.
Description
Technical field
Extracted automatically the present invention relates to field concept, field concept, artificial neural network, Deep Learning and depth
Conviction network technology field, it is specifically a kind of that suitable word type field concept feature is proposed based on Deep Learning
Feature Extraction Method.
Background technology
Field concept is a kind of form of expression of domain knowledge, and it is right that people describe in field certain using field concept
As communication sphere information.For example:" short message ", " CRBT " belong to the concept of moving communicating field, " data structure ", " computer network
Network " then belongs to the concept of computer realm.Say in a sense, field concept be the mankind in cognitive process for things
It is abstract, it is a kind of domain knowledge form of expression in the text, and reflect the development and change in the field to a certain extent.Neck
Domain concept is generally used more frequently in specific field, and in other field then using less.
Constituted according to whether by two or more word, field concept can be divided into word type and compound two class.It is existing
Research mostly be directed to compound field concept, and it is few research individually for word type field concept.However, existing list
Morphological pattern field concept abstracting method generally existing the problem that accuracy rate is not high, feature selecting is single, and researchers often only adopt
Take once the screening for field concept and non-field concept is completed to two kinds of a small amount of features, for the discriminating energy of noise
Power is weaker.Meanwhile, inadequate science, generally requires the result according to test of many times to select in the setting of feature weight and threshold value
More suitable value, artificial intervention is larger, and in the case where language material scale is changed, weight and threshold value are also required to make phase
The modification answered, it is portable poor.So, the extraction effect of word type field concept is in urgent need to be improved.
Neutral net is the ripe machine learning method of a class, and it provides a kind of practical and effective method from input number
Go out the function of real number value or vector value according to learning, and there is good robustness for the noise in data.Therefore, god
It is especially suitable for for learning the mapping relations between word type field concept multidimensional characteristic vectors and correspondence classification through network.Possess
The neutral net of multiple hidden layers possesses stronger ability to express, and Deep Learning are exactly mainly to be used for solving many hidden layers
Neutral net problem concerning study.
The content of the invention
The purpose of the present invention is directed to that the unsupervised method learning ability of tradition is weak, field concept extracts asking for effect on driving birds is not good
A kind of field concept abstracting method based on Deep Learning inscribed and provide, two are converted into by field concept extraction problem
Classification problem, employs the statistical nature of more horn of plenty, using the field concept extraction algorithm of Deep Learning, by Deep
Learning and field concept extract task and are combined, and unsupervised pre-training, Ran Houpei are carried out by building depth conviction net
The adjustment that traditional neural network model carries out having supervision is closed, the depth network model for finally training is compared with KNN, SVM model,
Highest F values are achieved in test data set.
Realizing the concrete technical scheme of the object of the invention is:
A kind of field concept abstracting method based on Deep Learning, the method includes step in detail below:
a)Training stage
The positive negative sample in training corpus is extracted first, and is marked;Then in conjunction with training corpus and background language
Material storehouse, aligning negative sample carries out feature extraction, structural feature vector set;Finally using set of eigenvectors and corresponding it is marked at
Training obtains depth network DN models in the environment of the deep learning tool box of matlab;
b)Test phase
Target is the depth network DN models that are obtained using the training stage checks the classifying quality to testing material storehouse;It is first
Candidate item extraction, feature extraction, structural feature vector set are first carried out to testing material storehouse successively;Then set of eigenvectors is input into
Depth network DN models, are automatically judged and are recognized using depth network DN models to characteristic vector, are realized to test language
Expect the classification of the candidate item in storehouse;Result and manual examination and verification finally according to classification obtain correct field concept collection.
The structural feature vector set, is constituted with following characteristics:
1) word frequency(TF);
2) document frequency(DF);
3) inverse document frequency(IDF);
4) word length(LEN);
5) word frequency variance(TV);
6) field consistent degree(DC).
The step a)Middle training obtains depth network model DN, specifically includes:
ⅰ)Carry out unsupervised learning to carry out construction depth conviction net merely with the characteristic vector of training data(Deep
Belief Nets, DBN);
By an incoming input layer of characteristic vector, the limitation Boltzmann machine of ground floor is trained(Restricted
Boltzmann Machine, RBM);Then ground floor RBM parameters are fixed, using the output of ground floor RBM as second layer RBM
Input, training second layer RBM;Similarly the parameter of fixed preceding two-layer RBM, third layer is completed using the output of second layer RBM
The training of RBM;After the characteristic vector of whole has been learnt, the training process of whole DBN also terminates;
ⅱ)Using the parameter initialization depth network DN of depth conviction net DBN, then using back-propagation algorithm, according to
The category label of training sample finely tunes depth network DN parameters with having carried out supervision, when iteration or error by some number of times
It is decreased in the range of 0.001 ~ 0.005, the parameter adjustment of Part II terminates;So far, the training stage of depth network DN models
Also just complete.
The step b)In to the classification of the candidate item in testing material storehouse be the extraction using field concept as binary classification,
That is " field concept " and " non-field concept ";According to the output valve of DN models, the co-occurrence probabilities p of candidate feature x and classification y is obtained
(x, y), the confidence level that a candidate concepts belong to classification y in the case where x is characterized as is weighed with it;X represents candidate concepts
Characteristic vector, and classification y represents one of " field concept ", " non-field concept " two classes;By dividing that training corpus is obtained
Class device, using the classification of grader automatic discrimination candidate concepts in test data set.
Extracted the invention provides a kind of field concept abstracting method based on Deep Learning, including field concept
In classification problem and propose Deep Learning field concept extraction algorithm, for the extraction of word type field concept,
The method has on identical experiment data set than traditional neural network model, classical KNN models and SVM models to domain term
More preferable recognition effect.
Deep Learning and field concept are extracted task and are combined by the present invention, are carried out by building depth conviction net
Unsupervised pre-training, then coordinates traditional neural network model to carry out the adjustment for having supervision, finally trains depth network mould
Type obtains accuracy rate higher in test data set, while also ensure that certain recall rate, overall recognition performance is best.
Using the present invention, the extraction result that Deep Learning technologies effectively obtain word type field concept can be based on,
There is positive effect for the research such as information retrieval, machine translation, body learning.
Brief description of the drawings
Fig. 1 is flow chart of the invention;
Fig. 2 is training flow chart of the invention;
Fig. 3 is test flow chart of the invention;
Fig. 4 is depth network architecture figure of the invention;
Fig. 5 is different classifications model experiment index comparison diagram.
Specific embodiment
The present invention is a kind of field concept abstracting method based on Deep Learning, and the method is taken out including field concept
The field concept of classification and Deep Learning in taking is extracted, wherein:Classification in the field concept extraction, by field
Concept extraction is used as binary classification, i.e. " field concept " and " non-field concept " two classes.Using the thought of machine learning, by instruction
Practice sample collection feature, structural classification device, using the classification of grader automatic discrimination candidate concepts in test data set.Specifically
For, classification is co-occurrence probabilities p (x, y) for estimating candidate concepts feature x and classification y, a candidate concepts is weighed with it and is existed
Belong to the confidence level of classification y in the case of being characterized as x.Here x represents the characteristic vector of candidate concepts, and classification y represents " neck
One of domain concept ", " non-field concept " two classes.
The field concept of the Deep Learning is extracted(Deep Learning based Domain Concept
Extraction Algorithm, DLDoC)Generally it is divided into training and two stages of test, as shown in figure 1, first by instruction
Practice module and obtain depth network using training data study(Deep Nets, DN)Model, then utilizes upper one in test module
The DN models that step training is obtained carry out automatic Classification and Identification to test data.For classification results, by way of manual examination and verification,
Correct field concept collection is finally obtained, is comprised the following steps that:
ⅰ)Training stage:Training stage completes the structure of depth network model.As shown in Fig. 2 extracting training corpus first
Positive negative sample in storehouse, and be marked;Then in conjunction with training corpus and background corpus, the positive negative sample to obtaining is carried out
Feature extraction, structural feature vector set;Finally utilize set of eigenvectors and corresponding flag data training pattern.Entirely trained
Journey can be understood as the mapping from training corpus to model, wherein sequentially passing through the conversion of sample space, feature space.
ⅱ)Test phase:Test phase is the DN models that are obtained using previous step training process to be checked to test data
The recognition effect of collection.As shown in figure 3, it is similar with training process, candidate item extraction, feature are carried out to testing material storehouse successively first
Extract, structural feature vector set;Then by set of eigenvectors be input into DN models, it characteristic vector can automatically be judged and
Identification, so as to realize the classification to candidate item;Result and artificial mark finally according to classification are compared, so as to calculate
Overall recognition effect.
The structural feature vector set:
The method of the TF-IDF used for most researchers, the present invention chooses following several features:
1)Word frequency(TF);
2)Document frequency(DF);
3)Inverse document frequency(IDF);
4)Word length(LEN);
5)Word frequency variance(TV);
6)Field consistent degree(DC).
The structure of the depth network DN models, as shown in Figure 4:
ⅰ)Carry out unsupervised learning to carry out construction depth conviction net merely with the characteristic vector of training data(Deep
Belief Nets, DBN).By an incoming input layer of characteristic vector, the RBM of ground floor is trained;Then ground floor RBM is fixed
Parameter, using the output of ground floor RBM as the input of second layer RBM, trains second layer RBM;Two-layer RBM before similarly fixing
Parameter, the training of third layer RBM is completed using the output of second layer RBM.After the characteristic vector of whole has been learnt, whole DBN
Training process also finish.
ⅱ)Using the parameter initialization DN of DBN, then using back-propagation algorithm, according to the category label of training sample
Finely tune with having carried out supervision, when the iteration or error by some number of times are decreased in the range of 0.001 ~ 0.005, Part II
Parameter adjustment just finish.So far, the training of DN models is also just completed, and can be used to carry out the classification of unknown sample pre-
Survey.
Embodiment
Below by taking military field material as an example, with reference to accompanying drawing, the present invention is further described.
Refering to Fig. 1, sample extraction is carried out first from training corpus, feature extraction is carried out from sample, select feature
Vector, obtains training pattern-DN models, and the DN models for obtaining carry out automatic Classification and Identification to test data.Tied for classification
Really, correct field concept collection can finally be obtained by way of manual examination and verification.
In the present embodiment, as shown in Fig. 2 realizing training corpus to the conversion of sample space, more than present invention selection
Several latent structure characteristic vectors, table 1 lists the feature of the part training sample that the present invention is extracted in military field material
Value.
The military field part training sample feature of table 1
Model training extracts the positive and negative sample set and corresponding characteristic vector set for obtaining, learning characteristic using first two steps
Relation between vector sum sample labeling data, trains depth network model(DN), the model is complete for each sample
Into from characteristic vector to the mapping of mark, that is, the parameter for obtaining DN models.
In the present embodiment, as shown in figure 3, from test sample, " commandant " characteristic vector:29 6 4.8078 2
208.9667 1.4144, by after test, this characteristic vector is identified as positive example, illustrate DN models to sample set have compared with
Good recognition capability.
The present invention simultaneously by build DN models combined with neutral net and with traditional KNN models and SVM models, enter
Row contrast, as shown in figure 5, relatively good and stabilization accuracy rate can then be obtained using the DBN+NN models that DBN pre-training is crossed,
13.05 percentage points and 23.09 percentage points of KNN models and SVM models are exceeded respectively.Refer in the F values of reflection overall performance
Put on, the DBN+NN models that the present invention builds obtain peak, more than 2.53 percentage points of SVM models, basic NN2 models
F values with KNN models are more or less the same.
Claims (3)
1. a kind of field concept abstracting method based on Deep Learning, it is characterised in that the method includes walking in detail below
Suddenly:
A) training stage
The positive negative sample in training corpus is extracted first, and is marked;Then in conjunction with training corpus and background corpus,
Aligning negative sample carries out feature extraction, structural feature vector set;Finally using set of eigenvectors and corresponding it is marked at matlab
Deep learning tool box in the environment of training obtain depth network DN models, wherein, depth network DN models according to as follows walk
Rapid training:
I) carry out unsupervised learning to carry out construction depth conviction net DBN merely with the characteristic vector of training data;
By an incoming input layer of characteristic vector, the limitation Boltzmann machine RBM of ground floor is trained;Then ground floor RBM is fixed
Parameter, using the output of ground floor RBM as the input of second layer RBM, trains second layer RBM;Two-layer RBM before similarly fixing
Parameter, the training of third layer RBM is completed using the output of second layer RBM;After the characteristic vector of whole has been learnt, entire depth
The training process of conviction net DBN also terminates;
Ii) using the parameter initialization depth network DN of depth conviction net DBN, then using back-propagation algorithm, according to training
The category label of sample finely tunes depth network DN parameters with having carried out supervision, when the iteration by some number of times or error reduce
To in the range of 0.001~0.005, the parameter adjustment of Part II terminates, to complete the training stage of depth network DN models;
B) test phase
Carry out candidate item extraction, feature extraction, structural feature vector set to testing material successively first;Then by set of eigenvectors
Input depth network DN models, are automatically judged and are recognized using depth network DN models to characteristic vector, are realized to surveying
Try the classification of the candidate item of corpus;Result and manual examination and verification finally according to classification obtain correct field concept collection.
2. method according to claim 1, it is characterised in that the structural feature vector set, is constituted with following characteristics:
Word frequency (TF);
Document frequency (DF);
Inverse document frequency (IDF);
Word length (LEN);
Word frequency variance (TV);
Field consistent degree (DC).
3. method according to claim 1, it is characterised in that to the classification of the candidate item of testing material in the step b)
It is, using the extraction of field concept as binary classification, candidate concepts to be divided into field concept and the class of non-field concept two;According to
The output valve of DN models, obtains co-occurrence probabilities p (x, y) of candidate feature x and classification y, and one is weighed with co-occurrence probabilities p (x, y)
Individual candidate concepts belong to the confidence level of classification y in the case where x is characterized as;X represents the characteristic vector of candidate concepts, and classification y
Represent one of field concept, the class of non-field concept two;The grader obtained by training corpus, is utilized in test data set
The classification of grader automatic discrimination candidate concepts.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410259300.1A CN104035996B (en) | 2014-06-11 | 2014-06-11 | Field concept abstracting method based on Deep Learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410259300.1A CN104035996B (en) | 2014-06-11 | 2014-06-11 | Field concept abstracting method based on Deep Learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104035996A CN104035996A (en) | 2014-09-10 |
CN104035996B true CN104035996B (en) | 2017-06-16 |
Family
ID=51466766
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410259300.1A Expired - Fee Related CN104035996B (en) | 2014-06-11 | 2014-06-11 | Field concept abstracting method based on Deep Learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104035996B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106055560A (en) * | 2016-05-18 | 2016-10-26 | 上海申腾信息技术有限公司 | Method for collecting data of word segmentation dictionary based on statistical machine learning method |
CN106228980B (en) * | 2016-07-21 | 2019-07-05 | 百度在线网络技术(北京)有限公司 | Data processing method and device |
CN106686403B (en) * | 2016-12-07 | 2019-03-08 | 腾讯科技(深圳)有限公司 | A kind of video preview drawing generating method, device, server and system |
CN106599577A (en) * | 2016-12-13 | 2017-04-26 | 重庆邮电大学 | ListNet learning-to-rank method combining RBM with feature selection |
CN106650806B (en) * | 2016-12-16 | 2019-07-26 | 北京大学深圳研究生院 | A kind of cooperating type depth net model methodology for pedestrian detection |
CN106980873B (en) * | 2017-03-09 | 2020-07-07 | 南京理工大学 | Koi screening method and device based on deep learning |
CN107679859B (en) * | 2017-07-18 | 2020-08-25 | 中国银联股份有限公司 | Risk identification method and system based on migration deep learning |
CN108959375A (en) * | 2018-05-24 | 2018-12-07 | 南京网感至察信息科技有限公司 | A kind of rule-based Knowledge Extraction Method with deep learning |
CN109543046A (en) * | 2018-11-16 | 2019-03-29 | 重庆邮电大学 | A kind of robot data interoperability Methodologies for Building Domain Ontology based on deep learning |
CN109597946B (en) * | 2018-12-05 | 2022-04-12 | 国网江西省电力有限公司信息通信分公司 | Bad webpage intelligent detection method based on deep belief network algorithm |
CN109871896B (en) * | 2019-02-26 | 2022-03-25 | 北京达佳互联信息技术有限公司 | Data classification method and device, electronic equipment and storage medium |
CN114626520A (en) * | 2022-03-01 | 2022-06-14 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for training model |
CN115357691B (en) * | 2022-10-21 | 2023-04-07 | 成都数之联科技股份有限公司 | Semantic retrieval method, system, equipment and computer readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101290626A (en) * | 2008-06-12 | 2008-10-22 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN101739430A (en) * | 2008-11-21 | 2010-06-16 | 中国科学院计算技术研究所 | Method for training and classifying text emotion classifiers based on keyword |
CN103365997A (en) * | 2013-07-12 | 2013-10-23 | 华东师范大学 | Opinion mining method based on ensemble learning |
CN103793510A (en) * | 2014-01-29 | 2014-05-14 | 苏州融希信息科技有限公司 | Classifier construction method based on active learning |
-
2014
- 2014-06-11 CN CN201410259300.1A patent/CN104035996B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101290626A (en) * | 2008-06-12 | 2008-10-22 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN101739430A (en) * | 2008-11-21 | 2010-06-16 | 中国科学院计算技术研究所 | Method for training and classifying text emotion classifiers based on keyword |
CN103365997A (en) * | 2013-07-12 | 2013-10-23 | 华东师范大学 | Opinion mining method based on ensemble learning |
CN103793510A (en) * | 2014-01-29 | 2014-05-14 | 苏州融希信息科技有限公司 | Classifier construction method based on active learning |
Non-Patent Citations (1)
Title |
---|
基于证据理论的多分类器中文微博观点句识别;郭云龙 等;《计 算 机 工 程》;20140430;第40卷(第4期);第159-163、169页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104035996A (en) | 2014-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104035996B (en) | Field concept abstracting method based on Deep Learning | |
CN113378632B (en) | Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method | |
CN108984745B (en) | Neural network text classification method fusing multiple knowledge maps | |
Wang et al. | Research on Web text classification algorithm based on improved CNN and SVM | |
CN109271506A (en) | A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning | |
CN110704624B (en) | Geographic information service metadata text multi-level multi-label classification method | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN103955702A (en) | SAR image terrain classification method based on depth RBF network | |
CN106095872A (en) | Answer sort method and device for Intelligent Answer System | |
CN106779087A (en) | A kind of general-purpose machinery learning data analysis platform | |
CN110942091B (en) | Semi-supervised few-sample image classification method for searching reliable abnormal data center | |
CN107451278A (en) | Chinese Text Categorization based on more hidden layer extreme learning machines | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN102662931A (en) | Semantic role labeling method based on synergetic neural network | |
CN106294344A (en) | Video retrieval method and device | |
CN107947921A (en) | Based on recurrent neural network and the password of probability context-free grammar generation system | |
CN105609116B (en) | A kind of automatic identifying method in speech emotional dimension region | |
CN106570521A (en) | Multi-language scene character recognition method and recognition system | |
CN111046179A (en) | Text classification method for open network question in specific field | |
CN104077598B (en) | A kind of emotion identification method based on voice fuzzy cluster | |
CN105260746B (en) | A kind of integrated Multi-label learning system of expansible multilayer | |
CN104091181A (en) | Injurious insect image automatic recognition method and system based on deep restricted Boltzmann machine | |
CN110298434A (en) | A kind of integrated deepness belief network based on fuzzy division and FUZZY WEIGHTED | |
CN103020167A (en) | Chinese text classification method for computer | |
CN105701225A (en) | Cross-media search method based on unification association supergraph protocol |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170616 Termination date: 20210611 |
|
CF01 | Termination of patent right due to non-payment of annual fee |