CN106156029A - The uneven fictitious assets data classification method of multi-tag based on integrated study - Google Patents
The uneven fictitious assets data classification method of multi-tag based on integrated study Download PDFInfo
- Publication number
- CN106156029A CN106156029A CN201510130968.0A CN201510130968A CN106156029A CN 106156029 A CN106156029 A CN 106156029A CN 201510130968 A CN201510130968 A CN 201510130968A CN 106156029 A CN106156029 A CN 106156029A
- Authority
- CN
- China
- Prior art keywords
- data
- label
- tag
- sample
- fictitious assets
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the uneven fictitious assets data classification method of a kind of multi-tag based on integrated study, comprise the following steps: under the distributed storage framework of fictitious assets, first having the random sampling put back to is carried out to fictitious assets data, afterwards, use feedforward neural network to learn multi-tag data, lie in the relevance between label in the neutral net connection weight training;Meanwhile, the distribution situation according to label in data from the sample survey selects to use SMOTE to sample;Finally, for promoting the Generalization Capability of grader, integrated study method is used, using neutral net as each Weak Classifier taken turns during study;Compared with prior art, the present invention is with classic algorithm Bagging in integrated study as framework, according to the feature of uneven fictitious assets data, is fused to feedforward neural network and SMOTE Sampling techniques in integrated study framework, can effectively improve the precision of classification.
Description
Technical field
This technology belongs to Networks and information security field, relates to a kind of multi-tag based on integrated study uneven
Fictitious assets data classification method.
Background technology
The fast development of internet is the generation of fictitious assets and transaction provides wide platform, promotes net
The easy prosperity and development of cross winding.But the provider no matter concluding the business for user or fictitious assets, all suffers from virtual
Asset data (includes fictitious assets merchandise news, related fictitious assets transaction data and fictitious assets behaviour
Make daily record etc.) numerous and jumbled problem.These fictitious assets data are classified, people can be helped more preferable
Manage and effectively improve the service efficiency of fictitious assets.
At present, China carried out based on eID domain space virtual asset management with save technical research from damage,
Realize the unified and standard management to fictitious assets.Fictitious assets safety system is comprehensive and accurate be have recorded to virtual
Assets commodity itself and relative various operation data, but these data on the one hand wide variety, no
Information with fictitious assets itself is different, and the operation behavior mode of user varies especially, virtual to these
Asset data carries out classification and faces many difficult problems.Additionally, the fictitious assets data volume difference between different classes of
Relatively big, as abnormal transaction data is generally much smaller than arm's length dealing data, and abnormal transaction data have multiple can
The situation of energy, as abnormal in exchange hour, dealing money exception and trading frequency are abnormal etc..Different is different
Often may exist, i.e. one fictitious assets data may belong to multiple classification or be labeled with multiple simultaneously
Label.For the unbalanced situation of data volume between multi-tag and classification, fictitious assets data are classified
Face lot of challenges.
In traditional classification problem, each sample is pertaining only to some classification or only one of which label, belongs to single
Label problem concerning study.But, as it was previously stated, a lot of samples belong simultaneously to multiple class in fictitious assets data
Not.Can by such issues that be attributed to multi-tag problem concerning study.Its formal definitions is, it is assumed that data set
D={x1,x2,L,xnComprising n sample, each sample has d attribute, and sample has m label L={l1,l2,L,lm}。
Then multi-tag classification problem is regarded as in the case of known training data category label, i.e.In the case of Yi Zhi, construct grader f:D → 2LSo that right
In attribute is known but test set that label is unknownCan correctly enter rower
Note.When training data and test data are all satisfied | Li| when=1, multi-tag classification problem deteriorates to multicategory classification problem.
Particularly, as m=2, multi-tag classification problem deteriorates to two traditional class classification problems.
In recent years, multi-tag classification problem receives extensive concern and the research of people, and its solution is main
Can be divided into based on algorithm adapt to method (Algorithm Adaption based methods) and turn based on problem
The method (Problem Transformation based methods) [1] changed.
Based on the method for algorithm conversion mainly by the existing single labeling algorithm of transformation so that it is can process
Multi-tag data.The advantage of such method is, in special practical problem, focuses on particular algorithm ratio and calculates
The method of method independence is superior.This kind of method mainly includes such as decision tree, SVMs (Support Vector
Machines, SVMs), nearest neighbor method (K-Nearest Neighbor, KNN) and neutral net etc..Clare
Use decision tree to solve multi-tag classification problem Deng [2], the definition of entropy in single labeling problem is extended to
In multi-tag classification problem.Andre etc. [3] are for multi-tag classification problem, it is proposed that based on the SVM of kernel function
Method.Zhang etc. [4] are for traditional kNN method, it is proposed that MLkNN: first calculate each label
Prior probability, when inputting data x to be sorted, to each label l in tag set L, respectively
Calculate x to there is label l and not there is the probability of label l, and then whether integrated forecasting x can be labeled with label l.
On this basis, Zhang etc. [5] further propose a kind of multi-tag Lazy learning sorting algorithm
ML-kNN.Zhang etc. [6] are by defining the global optimization function for multi-tag data so that manually god
Can process multi-tag data through network, its basic thought is, if multiple sample is provided simultaneously with two labels,
So one of the two label occurs in that, another is likely to occurs.But such method exists
Needing there is deeper understanding to corresponding algorithm and application knowledge, for domestic consumer, difficulty is higher
Defect it is of limited application and be difficult to promote.
But, mainly multi-tag classification problem is converted into one group of list label based on the method that problem converts and divides
Class problem, thus utilize existing single labeling method to solve this problem.Such method is owing to can directly make
With existing various maturation methods, it is readily appreciated that, it therefore is widely applied.Typically turn based on problem
The multiple labeling sorting technique changing thought is binary cross-linking method (Binary Relevance, BR), will each mark
An independent single classification problem is regarded in the prediction signed as, and for one independent grader of each label training,
Afterwards, with whole training datas, each grader is trained.But the method have ignored between label
Connecting each other, effect is still unsatisfactory.BR is carried out by document [7] by the method for copy and cum rights copy
Improve, multi-tag data in former training set are split into a plurality of single label data, and gives phase
The weight answered.Singer etc. [8] propose based on AdaBoost's for the multi-tag classification problem of text
BOOSTEXTER method, gives the relatively difficult sample of classification bigger weight every time.McCallum etc. [9]
Propose the Bayesian method for multi-tag classification, use expectation maximization (Expectation
Maximization, EM) learn.Ueda and Saito [10] proposes two generations of multi-tag classification
Formula model PMM1 and PMM2.Read etc. [11] propose LP (Label Powerset) method, to training
Every kind of tag combination in data carries out binary coding, thus forms new label.The method is by coding
Mode by multi-tag data convert for single label data.But its algorithm complex is higher.Schapire etc. [12]
Propose the AdaBoost.MH calculation being applicable to multi-tag data based on single labeling method AdaBoost.M1
Method, this algorithm generates q new single label training data, adds training to each multi-tag training data
The quantity of data, thus add the time of model training.And these methods all relate only to single algorithm,
Tsoumakas etc. [13] are on the basis of summarizing to existing algorithm, it is proposed that the multi-tag classification mould of two-layer
Type, uses BR, decision tree or SVM to carry out K folding cross-training in ground floor, in the second layer, adopts
Classified further by BR or SVM algorithm, achieve preferable effect.
In the studies above, simply think not when being one group of list label problem by multiple label PROBLEM DECOMPOSITION
It with being independent between label, is trained on single label data i.e. every time, seldom consider label
Between relevance.And in actual fictitious assets data, the relevance between label generally exists and many
Plant various.Additionally, the sample size that is mostly based between classification of existing algorithm substantially quite this it is assumed that but
In a lot of data including fictitious assets data, the disequilibrium between classification (label) generally exists.
Additionally, part has been studied and simply used the training that single or a small amount of grader carries out multi-tag data, general
Change performance to be under some influence.
[1]Tsoumakas G and Katakis I.Multi-label classification:An overview[J].
International Journal of Data Warehousing and Mining(IJDWM),2007,3(3):1-13.
[2]Clare A and King R D.Knowledge discovery in multi-label phenotype
data[M]//Principles of data mining and knowledge discovery.Springer Berlin
Heidelberg,2001:42-53.
[3]Elisseeff A and Weston J.A kernel method for multi-labelled
classification[C]//Advances in neural information processing systems,2001:
681-687.
[4]Zhang M L and Zhou Z H.A k-nearest neighbor based algorithm for
multi-label classification[C]//Granular Computing,2005IEEE International
Conference on.IEEE,2005,2:718-721.
[5]Zhang M L and Zhou Z H.ML-KNN:A lazy learning approach to
multi-label learning[J].Pattern recognition,2007,40(7):2038-2048.
[6]Zhang M L and Zhou Z H.Multilabel neural networks with applications
to functional genomics and text categorization[J].Knowledge and Data Engineering,
IEEE Transactions on,2006,18(10):1338-1351.
[7]Shen X,Boutell M,Luo J,et al.Multilabel machine learning and its
application to semantic scene classification[C]//Electronic Imaging 2004.
International Society for Optics and Photonics,2003:188-199.
[8]Schapire R.E.and Singer Y.BoosTexter:a boosting-based system for text
categorization[J].Machine Learning,2000,39(2-3):135-168.
[9]McCallum A.Multi-label text classification with a mixture model trained by
EM[C]//AAAI’99 Workshop on Text Learning,1999:1-7
[10]Ueda N.and Saito.K.Parametric mixture models for multi-label
text[C]//Advances in neural information processing systems,2002:721-728.
[11]Read J.A pruned problem transformation method for multi-label
classification[C]//Proc.2008 New Zealand Computer Science Research Student
Conference(NZCSRS 2008),2008:143-150.
[12]Schapire R.E.and Singer Y.Improved boosting algorithms using
confidence-rated predictions[J].Machine learning,1999,37(3):297-336.
[13]Tsoumakas G,Dimou A,Spyromitros E,et al.Correlation-based pruning
of stacked binary relevance models for multi-label learning[C]//Proceedings of the
1st International Workshop on Learning from Multi-Label Data,2009:101-116.
Content of the invention
For above-mentioned technical problem, the present invention proposes by a kind of multi-tag based on integrated study uneven empty
Intend asset data sorting technique and realize that fictitious assets data are classified, it is adaptable to virtual money numerous and jumbled on internet
Produce data to classify.It is particularly suited for unbalanced multi-tag fictitious assets data classification between classification.
Technical solution of the present invention includes: description, the multi-tag imbalance of fictitious assets data storage architecture are virtual
The process of asset data and the structure of grader.
1st, fictitious assets storage architecture describes
Fictitious assets storage uses Distributed Architecture, including the Organization And Management of the many structured datas of magnanimity, magnanimity
The parts such as the query processing of many structured datas, service issue and DLL.
System bottom framework is deployed on traditional DCE, is realized by distributed file system
Transparent access to file data on node each in DCE.Basis in distributed file system
On, file system or data to distribution for the Organization And Management's subsystem responsible of the many structured datas of magnanimity are united
One management, wherein, is to be organized by data and data management module completes to the unified management of file or data
's.Additionally, also include deployment in bottom DCE for the different pieces of information/file and configuration management.
The inquiry treatment subsystem of the many structured datas of magnanimity is towards magnanimity personal identification/attribute information retrieval class should
With, support the efficient query processing of many structured datas, including complex data model, blended data operator scheme
Etc. module.Present invention is generally directed to log analysis therein and excavate module, it is intended to utilizing data mining technology
Rapidly and efficiently detect abnormal behaviour present in fictitious assets process of exchange.
Service issue, customization and DLL subsystem are the external interfaces of system, in the way of service-oriented
Carry out programmatic interface definition to data, support to the SQL query of structural data, to destructuring number
According to API and class SQL query;Support user in the way of service interface customization, to personal information inquiry clothes
Business interface carries out self-defined.The data access interface that the present invention also can utilize system to provide realizes to virtual money
Produce inquiry and the analysis of transaction data.When the reality application present invention, both can carry out the excavation of daily record with
Analyze, it is also possible to carry out data query and analysis by data-interface, it is also possible to two ways is combined.Root
According to the difference of practical problem, optimal mode can be used.
2nd, the structure of the process of the uneven fictitious assets data of multi-tag and grader
In multi-tag classification problem, the association type between label is various, finds out all of rule of combination meter
Calculation amount is poor compared with the grader generalization that big and study obtain, and actual popularization and application effect also cannot be guaranteed.
Therefore, use one not need to explain label correlation rule, be but beneficial to the sample pass corresponding with between label of study
Between system and different label, the grader of relevance is very crucial.Through generalized analysis, the present invention uses
Neutral net is trained study.For improving the Generalization Capability of grader, use integrated learning approach, by god
Through network as each Weak Classifier taken turns during study.Meanwhile, for solving fictitious assets data
Disequilibrium problem, in each uneven comparison data taken turns between according to different classes of data of integrated study
Sample.Specifically comprise the following steps that
(1) feedforward neural network
The present invention uses typical multilayer feedforward neural network (Multi-Layer Feed-Forward Neural
Networks), being made up of input layer, hidden layer and output layer, each layer comprises some neurons.Wherein,
Input layer is responsible for receiving outer input data, and output layer will finally produce result output.Hidden layer is mainly used in
The memory of neutral net.In feedforward neural network, the neuron phase all with next layer for the neuron of each layer
Connect, and typically will not connect between the neuron of this layer.During neural metwork training, with minimum
Change error in classification as target.The present invention uses typical gradient decline (Gradient Descent) algorithm to enter
Row connection weight and bigoted renewal learning.
The present invention is to minimize error in classification as target, and its global error can be calculated by formula 1.
Wherein, m is the number of sample in training set, EiIt is the error of training every time, calculate such as formula 2 institute
Show.
Wherein, k is the number of label,It is the true output of neutral net,It is desired output.
Design parameter is provided that
Input layer number is the attribute number of fictitious assets data, i.e. each input layer pair
Answer a fictitious assets data attribute;
Output layer neuron number is fictitious assets class label number, i.e. each output neuron correspondence one
Individual label;
Hidden layer neuron number uses Baum-Haussler rule [1] to determine, specifically calculates such as formula 3 institute
Show.
Nhidden≤(Ntrain·Etolerance)/(Ninputs+Noutputs) (3)
Wherein, NhiddenFor the number of hidden layer neuron, NtrainIt is the number of training sample, EtoleranceIt is neural
The acceptable upper error of network, NinputsAnd NoutputsRepresent the number of input and output neuron respectively.
Excitation function uses traditional Sigmoid function.Form is as shown in Equation 4.
The end condition of study can be to reach maximum study number of times, it is also possible to is the mistake between adjacent study twice
Difference, less than the threshold value setting, can set according to actual conditions flexibly.
The output result of neutral net is successive value, and the classification of fictitious assets data is discrete.Therefore,
Output result, by setting threshold value, is converted into centrifugal pump 0 or 1 by us.Calculating process is as shown in Equation 5.
Wherein, out be neutral net original output result, c for by experiment think set threshold value.Only
Having when original output result is more than the threshold value setting, test sample just should be labeled with corresponding label.
(2) SMOTE sampling method
SMOTE (Synthetic Minority Over-sampling Technique) [2] is the flat of a kind of classics
The method of weighing apparatus data, reduces the disequilibrium between data by way of synthesizing new minority class sample.
SMOTE method based on existing minority class example in the similarity of feature space, for example xi, choose every time away from
K example of its nearest neighbours.Afterwards, from this k neighbour, one is randomly choosed, it is assumed that be xknn, then newly
The example x of synthesissyn=xi+(xknn-xi)× α, wherein α is a random number between 0 and 1.This
Synthetic method had both avoided random over-sampling over-fitting that may be present problem, also made decision boundary to majority
Class moves, thus improves the nicety of grading of minority class, is widely used.
(3) the integrated study of multi-tag unbalanced data
In machine-learning process, the precision of single grader and generalization ability are often difficult to simultaneously very high.And collect
Become study to be learnt by using a series of grader, and use certain rule the result of each grader
Carry out integrating thus obtain more preferable results of learning and generalization ability than single grader.The present invention is with integrated
Classic algorithm Bagging [3] in habit is framework, according to the feature of uneven fictitious assets data, and will feedforward
Neutral net and SMOTE Sampling techniques are fused in integrated study framework.Specific algorithm step is as follows:
Step1: after given training sample set, put back to sampling at random by k time, from training sample set
Extraction obtains a training subset S ' comprising k sample.
Step2: each label L' in statistics training subsetiFrequency of occurrences fi, calculate each label frequency and the most successively
Ratio f between big frequencyi/fmax(i=1, L, k).If fi/fmaxExceed the minimum scale between label frequency
Threshold value, then carry out over-sampling to the sample comprising label L'i;If fi=1, i.e. comprise label L'iSample number be
1, SMOTE method now cannot be used to carry out the synthesis of new samples, therefore use simple copy strategy, obtain
Replicate set;If fi> 1, then use SMOTE method to comprising label L'iSample sample, obtain
Sampling set.The data that original training sample and duplication and sampling obtain the most at last merge, and obtain one
The training subset S of class label approximate equilibrium ".
Step3: to set S " carries out normalized as shown in Equation 6, then data after normalization
Feedforward neural network as described in (1) for the training on collection.
Step4: repetition step 1, to step 3, obtains N number of neural network model training.
Step5: for test sampleFor test set, r is test sample number, will
It is separately input in all N number of neural network models training, and adds up its output result, obtains size
Output matrix C for m × N.
Step6: initialization result tag set Ω is sky, by row traversal output matrix C, ifThen by label liAdd in tag set Ω, be not otherwise added.Traversal
Complete Matrix C can finally give sample xtClass label set omega.
Compared with the prior art, present patent application is by integrated learning approach Bagging, multilayer feedforward neural network
And unbalanced data Sampling techniques SMOTE method combines, applied to the uneven virtual money of multi-tag
Produce in the classification of data, the precision of classification can be effectively improved.
[1]Baum E B,Haussler D.What size net gives valid generalization?[J].Neural
computation,1989,1(1):151-160.
[2]Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:synthetic minority
over-sampling technique[J].Journal of artificial intelligence research,2002,16(1):
321-357.
[3]http://en.wikipedia.org/wiki/Bootstrap_aggregating.
Brief description
Fig. 1 magnanimity many structures fictitious assets data management system Organization Chart
Fig. 2 is algorithm flow chart
Detailed description of the invention
Further illustrate technical scheme below by detailed description of the invention:
Technical scheme includes: the description of fictitious assets storage architecture, the adopting of uneven transaction data
Sample and the structure of grader.
1st, the description of fictitious assets storage architecture
Fictitious assets storage uses Distributed Architecture, and its architectural framework is as shown in Figure 1.System bottom Business Information and IT Solution Mgmt Dep
Administration, on traditional DCE, is realized in DCE by distributed file system
The transparent access of file data on each node.Distributed computational nodes includes 170 high-performance servers (two
Intel Xeon E5640,2.66GHz;16G DDR3 internal memory;Two pieces of PCI-Express;Redundant power and
Fan), built-in 1 the 1TB disk of each server, for improving stability and the bandwidth of network, join
Putting two set networks, network system uses the gigabit switch of 10 48 mouthfuls to be formed by connecting.Additionally, for strengthening
Disaster-tolerant backup ability, system also includes 8 dish battle arrays, 800 pieces of 1TB hard disks, 48 dish cabinets, 32 pieces of RAID
Card, 8 SAN switch.On the basis of distributed file system, the tissue of the many structured datas of magnanimity with
File system or data to distribution for the management subsystem responsible are managed collectively, wherein, to file or data
Unified management completed by data tissue and data management module.
The inquiry treatment subsystem of the many structured datas of magnanimity is towards magnanimity personal identification/attribute information retrieval class should
With, support the efficient query processing of many structured datas, including complex data model, blended data operator scheme
Etc. module.Present invention is generally directed to log analysis therein and excavate module, it is intended to utilizing data mining technology
Rapidly and efficiently detect abnormal behaviour present in fictitious assets process of exchange.
Service issue, customization and DLL subsystem are the external interfaces of system, in the way of service-oriented
Carry out programmatic interface definition to data, support to the SQL query of structural data, to destructuring number
According to API and class SQL query;Support user in the way of service interface customization, to personal information inquiry clothes
Business interface carries out self-defined.The data access interface that the present invention also can utilize system to provide realizes to virtual money
Produce inquiry and the analysis of transaction data.When the reality application present invention, both can carry out the excavation of daily record with
Analyze, it is also possible to carry out data query and analysis by data-interface, it is also possible to two ways is combined.Root
According to different practical problems, use optimal mode.
2nd, the structure of the sampling of uneven transaction data and grader
In multi-tag classification problem, the association type between label is various, finds out all of rule of combination meter
Calculation amount is poor compared with the grader generalization that big and study obtain, and actual popularization and application effect cannot be guaranteed.Cause
This, use one not need to explain label correlation rule, but fine can obtain learning sample pass corresponding with label
Between system and different label, the grader of relevance is critical.Through generalized analysis, the present invention adopts
It is trained study by neutral net.For improving the Generalization Capability of grader, use integrated learning approach, will
Neutral net is as each Weak Classifier taken turns during study.Meanwhile, for solving fictitious assets data
Disequilibrium problem, each imbalance taken turns between according to different classes of data at integrated study compares logarithm
According to sampling.Specifically comprise the following steps that
1) feedforward neural network
The present invention uses typical multilayer feedforward neural network (Multi-Layer Feed-Forward Neural
Networks), being made up of input layer, hidden layer and output layer, each layer comprises some neurons.Wherein,
Input layer is responsible for receiving outer input data, and output layer will finally produce result output.Hidden layer is mainly used in
The memory of neutral net.In feedforward neural network, the neuron phase all with next layer for the neuron of each layer
Connect, and typically will not connect between the neuron of this layer.During neural metwork training, with minimum
Change error in classification as target.The present invention uses typical gradient decline (Gradient Descent) algorithm to enter
Row connection weight and bigoted renewal learning.
The present invention is to minimize error in classification as target, and its global error can be calculated by formula 1.
Wherein, m is the number of sample in training set, EiIt is the error of training every time, calculate such as formula 2 institute
Show.
Wherein, k is the number of label,It is the true output of neutral net,It is desired output.
During the training of the present embodiment, parameter is provided that
Input layer number is the attribute number of fictitious assets data, i.e. each input layer pair
Answer a fictitious assets data attribute;
Output layer neuron number is fictitious assets class label number, i.e. each output neuron correspondence one
Individual label;
Hidden layer neuron number uses Baum-Haussler rule [1] to determine, specifically calculates such as formula 3 institute
Show.
Nhidden≤(Ntrain·Etolerance)/(Ninputs+Noutputs) (3)
Wherein, NhiddenFor the number of hidden layer neuron, NtrainIt is the number of training sample, EtoleranceIt is neural
The acceptable upper error of network, NinputsAnd NoutputsRepresent the number of input and output neuron respectively.
Excitation function uses traditional Sigmoid function.Form is as shown in Equation 4.
The end condition of study can be to reach maximum study number of times, it is also possible to is between adjacent study twice
Error, less than the threshold value setting, can set according to actual conditions flexibly.
The output result of neutral net is successive value, and the classification of fictitious assets data is discrete.Therefore,
Output result, by setting threshold value, is converted into centrifugal pump 0 or 1 by us.Calculating process is as shown in Equation 5.
Wherein, out be neutral net original output result, c for by experiment think set threshold value.Only
Having when original output result is more than the threshold value setting, test sample just should be labeled with corresponding label.
2) SMOTE method
SMOTE method based on existing minority class example in the similarity of feature space, for example xi, every time
Its k nearest example of selected distance.Afterwards, from this k neighbour, one is randomly choosed, it is assumed that be xknn,
Then newly synthesized example xsyn=xi+(xknn-xi) × α, wherein α is a random number between 0 and 1.
This synthetic method had both avoided random over-sampling over-fitting that may be present problem, also make decision boundary to
Most classes move, thus improve the nicety of grading of minority class, are widely used.
3) the integrated study of multi-tag unbalanced data
In machine-learning process, the precision of single grader and generalization ability are often difficult to simultaneously very high.And collect
Become study to be learnt by using a series of grader, and use certain rule the result of each grader
Carry out integrating thus obtain more preferable results of learning and generalization ability than single grader.The present invention is with integrated
Classic algorithm Bagging in habit is framework, according to the feature of uneven fictitious assets data, and will feedforward god
It is fused in integrated study framework through network and SMOTE Sampling techniques.Concrete operation step is as follows:
Step1: given training sample setWherein D={x1,x2,L,xn}
Comprising n sample, each sample has d attribute, L={l1,l2,L,lmComprise m label.By k time
Putting back to sampling at random, extraction from training sample set obtains a training subset comprising k sample
Step2: each label L' in statistics training subsetiThe frequency of occurrences, be designated as fre={f1,f2,L,fk}.Note
fmax=maxfi(i=1, L k), calculate the ratio f between each label frequency and peak frequency successivelyi/fmax
(i=1, L, k).If fi/fmax≤ Threshold, Threshold are the minimum scale threshold value between label frequency,
Then to comprising label L'iSample carry out over-sampling.If fi=1, i.e. comprise label L'iSample number be 1, this
Shi Wufa uses the SMOTE method described in (2) to carry out the synthesis of new samples, therefore uses simple copy
Strategy, obtain replicate set D1'.If fi> 1, then use SMOTE method to comprising label L'iSample
Carry out over-sampling, obtain over-sampling set D'2.The number that original training sample and duplication and sampling obtain the most at last
According to merging, obtain the training subset S of a class label approximate equilibrium "=(x "1,L1”),(x”2,L”2),L,(x”k',L”k')}
Step3: to set S " carries out normalized as shown in Equation 6, then data after normalization
Feedforward neural network as described in (1) for the training on collection.
Step4: repetition step 1, to step 3, obtains N number of neural network model training
{NN1,NN2,L,NNN}。
Step5: for test sampleFor test set, r is test sample number.Will
It is separately input in all N number of neural network models training, and adds up its output result, obtains size
Output matrix for m × N Wherein, cij=0 or 1, i=1 ... m, j=1 ... N.In C
Every a line represents all N number of graders to test sample xtThe respective judged result of affiliated label.
Step6: initialization result tag set Ω is sky, Ω={ }.By row traversal output matrix C, ifThen label li is added in tag set Ω, Ω=Ω+{ li, otherwise do not carry out
Add.Travel through Matrix C and can finally give sample xtClass label set omega.
Compared with the prior art, by integrated learning approach Bagging, multilayer feedforward neural network and imbalance
Data sampling techniques SMOTE method combines, and is applied to dividing of the uneven fictitious assets data of multi-tag
In class, the precision of classification can be effectively improved.
It is above having carried out exemplary description to the present invention, it is clear that the realization of the present invention is not by aforesaid way
Restriction, as long as have employed the various improvement that technical solution of the present invention is carried out or not improved by the present invention's
Design and technical scheme directly apply to other occasions, all within the scope of the present invention.
Claims (3)
1. the multi-tag imbalance fictitious assets data classification method based on integrated study, it is characterised in that
Comprise the following steps: the uneven fictitious assets data of the description of fictitious assets data storage architecture and multi-tag
Process the structure with grader;
Wherein the uneven process of fictitious assets data of multi-tag and the construction step of grader include: use god
Be trained study through network, and combine integrated study method, using neutral net as each take turns study during
Weak Classifier;Meanwhile, each imbalance taken turns between according to different classes of data at integrated study
Comparison data is sampled.
2. the uneven fictitious assets data classification of the multi-tag based on integrated study according to claim 1
Method, it is characterised in that the process of the uneven fictitious assets data of described multi-tag and the structure of grader
Step includes:
Step one, feedforward neural network;
Step 2, SMOTE sampling method;
Step 3, the integrated study of multi-tag unbalanced data.
3. grader according to claim 2 builds, in described integrated of multi-tag unbalanced data
Practise in step, it is characterised in that comprise the following steps:
1) after giving training sample set S, after repeatedly putting back to sampling at random, every time from training sample set
Extraction in conjunction obtains sample one training subset S ' of composition;
2) add up the frequency of occurrences of each label in training subset, calculate successively each label frequency and peak frequency it
Between ratio:
If this ratio exceedes the minimum scale threshold value between label frequency, then the sample comprising each label is entered
Row over-sampling;If the frequency of occurrences of certain label is 1, the sample number i.e. comprising this label is 1, then use letter
Single replication strategy, obtains replicating set;It is more than 1 if there is frequency, then use SMOTE method to comprising this
The sample of label is sampled, and obtains sampling set S ';
The data that original training sample and duplication and sampling obtain the most at last merge, and obtain a classification mark
Sign the training subset S of approximate equilibrium ";
3) to set S " it is normalized by the formula shown in following, then data set after normalization
Upper training feedforward neural network;
Wherein xi *”For subset S " i-th samples normalization process after sample, xi" be subset S " and i-th sample
This;Min, max are respectively maximum and the minimum of a value of sample in this subset;
4) step 1 is repeated) to step 3), draw the neural network model after training;
5) all test samples are separately input to step 4) in gained neural network model, add up its output
As a result, output matrix is obtained;
6) setting up initialization result tag set, this tag set is sky, by row traversal output matrix, passes through
Most Voting principles, the output result of all graders will add up, if result exceedes half, then will
Label adds in tag set, is not otherwise added, and after complete of traversal, matrix draws final sample
Class label set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510130968.0A CN106156029A (en) | 2015-03-24 | 2015-03-24 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510130968.0A CN106156029A (en) | 2015-03-24 | 2015-03-24 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106156029A true CN106156029A (en) | 2016-11-23 |
Family
ID=57339484
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510130968.0A Pending CN106156029A (en) | 2015-03-24 | 2015-03-24 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106156029A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106973057A (en) * | 2017-03-31 | 2017-07-21 | 浙江大学 | A kind of sorting technique suitable for intrusion detection |
CN107180155A (en) * | 2017-04-17 | 2017-09-19 | 中国科学院计算技术研究所 | A kind of disease forecasting method and system based on Manufacturing resource model |
CN107870321A (en) * | 2017-11-03 | 2018-04-03 | 电子科技大学 | Radar range profile's target identification method based on pseudo label study |
CN108153657A (en) * | 2017-12-22 | 2018-06-12 | 北京交通大学 | The method of large-scale data center server application Partition of role |
CN109033471A (en) * | 2018-09-05 | 2018-12-18 | 中国信息安全测评中心 | A kind of information assets recognition methods and device |
CN109190698A (en) * | 2018-08-29 | 2019-01-11 | 西南大学 | A kind of classifying and identifying system and method for network digital fictitious assets |
CN109272003A (en) * | 2017-07-17 | 2019-01-25 | 华东师范大学 | A kind of method and apparatus for eliminating unknown error in deep learning model |
CN110147804A (en) * | 2018-05-25 | 2019-08-20 | 腾讯科技(深圳)有限公司 | A kind of unbalanced data processing method, terminal and computer readable storage medium |
CN110968693A (en) * | 2019-11-08 | 2020-04-07 | 华北电力大学 | Multi-label text classification calculation method based on ensemble learning |
CN111105238A (en) * | 2019-11-07 | 2020-05-05 | 中国建设银行股份有限公司 | Transaction risk control method and device |
CN112530595A (en) * | 2020-12-21 | 2021-03-19 | 无锡市第二人民医院 | Cardiovascular disease classification method and device based on multi-branch chain type neural network |
CN113255831A (en) * | 2021-06-23 | 2021-08-13 | 长沙海信智能系统研究院有限公司 | Sample processing method, device, equipment and computer storage medium |
CN113657446A (en) * | 2021-07-13 | 2021-11-16 | 广东外语外贸大学 | Processing method, system and storage medium of multi-label emotion classification model |
CN118468151A (en) * | 2024-06-28 | 2024-08-09 | 深圳市广通工程顾问有限公司 | Automatic management method and system for classification of network digital virtual assets |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090097741A1 (en) * | 2006-03-30 | 2009-04-16 | Mantao Xu | Smote algorithm with locally linear embedding |
CN102945280A (en) * | 2012-11-15 | 2013-02-27 | 翟云 | Unbalanced data distribution-based multi-heterogeneous base classifier fusion classification method |
CN103500205A (en) * | 2013-09-29 | 2014-01-08 | 广西师范大学 | Non-uniform big data classifying method |
CN104091073A (en) * | 2014-07-11 | 2014-10-08 | 中国人民解放军国防科学技术大学 | Sampling method for unbalanced transaction data of fictitious assets |
CN104102700A (en) * | 2014-07-04 | 2014-10-15 | 华南理工大学 | Categorizing method oriented to Internet unbalanced application flow |
-
2015
- 2015-03-24 CN CN201510130968.0A patent/CN106156029A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090097741A1 (en) * | 2006-03-30 | 2009-04-16 | Mantao Xu | Smote algorithm with locally linear embedding |
CN102945280A (en) * | 2012-11-15 | 2013-02-27 | 翟云 | Unbalanced data distribution-based multi-heterogeneous base classifier fusion classification method |
CN103500205A (en) * | 2013-09-29 | 2014-01-08 | 广西师范大学 | Non-uniform big data classifying method |
CN104102700A (en) * | 2014-07-04 | 2014-10-15 | 华南理工大学 | Categorizing method oriented to Internet unbalanced application flow |
CN104091073A (en) * | 2014-07-11 | 2014-10-08 | 中国人民解放军国防科学技术大学 | Sampling method for unbalanced transaction data of fictitious assets |
Non-Patent Citations (2)
Title |
---|
HU LI, PENG ZOU ET. AL: ""A Combination Method for Multi-Class Imbalanced Data Classification"", 《2013 10TH WEB INFORMATION SYSTERM AND APPLICATION CONFERENCE》 * |
HU LI, PENG ZOU ET. AL: ""Ensemble Multi-Label Learning Based on Neural Network"", 《ICIMCS’13 PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON INTERNET MULTIMEDIA COMPUTING AND SERVICE》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106973057B (en) * | 2017-03-31 | 2018-12-14 | 浙江大学 | A kind of classification method suitable for intrusion detection |
CN106973057A (en) * | 2017-03-31 | 2017-07-21 | 浙江大学 | A kind of sorting technique suitable for intrusion detection |
CN107180155B (en) * | 2017-04-17 | 2019-08-16 | 中国科学院计算技术研究所 | A kind of disease forecasting system based on Manufacturing resource model |
CN107180155A (en) * | 2017-04-17 | 2017-09-19 | 中国科学院计算技术研究所 | A kind of disease forecasting method and system based on Manufacturing resource model |
CN109272003A (en) * | 2017-07-17 | 2019-01-25 | 华东师范大学 | A kind of method and apparatus for eliminating unknown error in deep learning model |
CN107870321A (en) * | 2017-11-03 | 2018-04-03 | 电子科技大学 | Radar range profile's target identification method based on pseudo label study |
CN107870321B (en) * | 2017-11-03 | 2020-12-29 | 电子科技大学 | Radar one-dimensional range profile target identification method based on pseudo-label learning |
CN108153657A (en) * | 2017-12-22 | 2018-06-12 | 北京交通大学 | The method of large-scale data center server application Partition of role |
CN110147804A (en) * | 2018-05-25 | 2019-08-20 | 腾讯科技(深圳)有限公司 | A kind of unbalanced data processing method, terminal and computer readable storage medium |
CN110147804B (en) * | 2018-05-25 | 2023-07-14 | 腾讯科技(深圳)有限公司 | Unbalanced data processing method, terminal and computer readable storage medium |
CN109190698B (en) * | 2018-08-29 | 2022-02-11 | 西南大学 | Classification and identification system and method for network digital virtual assets |
CN109190698A (en) * | 2018-08-29 | 2019-01-11 | 西南大学 | A kind of classifying and identifying system and method for network digital fictitious assets |
CN109033471A (en) * | 2018-09-05 | 2018-12-18 | 中国信息安全测评中心 | A kind of information assets recognition methods and device |
CN109033471B (en) * | 2018-09-05 | 2022-11-08 | 中国信息安全测评中心 | Information asset identification method and device |
CN111105238A (en) * | 2019-11-07 | 2020-05-05 | 中国建设银行股份有限公司 | Transaction risk control method and device |
CN110968693A (en) * | 2019-11-08 | 2020-04-07 | 华北电力大学 | Multi-label text classification calculation method based on ensemble learning |
CN112530595A (en) * | 2020-12-21 | 2021-03-19 | 无锡市第二人民医院 | Cardiovascular disease classification method and device based on multi-branch chain type neural network |
CN113255831A (en) * | 2021-06-23 | 2021-08-13 | 长沙海信智能系统研究院有限公司 | Sample processing method, device, equipment and computer storage medium |
CN113657446A (en) * | 2021-07-13 | 2021-11-16 | 广东外语外贸大学 | Processing method, system and storage medium of multi-label emotion classification model |
CN118468151A (en) * | 2024-06-28 | 2024-08-09 | 深圳市广通工程顾问有限公司 | Automatic management method and system for classification of network digital virtual assets |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106156029A (en) | The uneven fictitious assets data classification method of multi-tag based on integrated study | |
Gao et al. | Discriminative learning of relaxed hierarchy for large-scale visual recognition | |
Shen et al. | Multi-level discriminative dictionary learning with application to large scale image classification | |
Le et al. | Probabilistic latent document network embedding | |
Tsai et al. | Evolutionary instance selection for text classification | |
Rashedi et al. | A hierarchical clusterer ensemble method based on boosting theory | |
Czarnowski et al. | An approach to data reduction for learning from big datasets: Integrating stacking, rotation, and agent population learning techniques | |
Sun et al. | Boosting ant colony optimization via solution prediction and machine learning | |
CN103412878B (en) | Document theme partitioning method based on domain knowledge map community structure | |
Yang et al. | Local label descriptor for example based semantic image labeling | |
Xue | Semi‐supervised convolutional generative adversarial network for hyperspectral image classification | |
Li et al. | Feature subset selection: a correlation‐based SVM filter approach | |
Hao et al. | Class-wise dictionary learning for hyperspectral image classification | |
Nareshpalsingh et al. | Multi-label classification methods: A comparative study | |
Xu et al. | Remotely sensed image classification by complex network eigenvalue and connected degree | |
Shi et al. | Multi-label classification based on multi-objective optimization | |
Qin et al. | A novel factor analysis-based metric learning method for kinship verification | |
Ganji et al. | Lagrangian constrained community detection | |
Lin et al. | The distributed system for inverted multi-index visual retrieval | |
Deng et al. | Differences help recognition: a probabilistic interpretation | |
Farhangi et al. | Informative visual words construction to improve bag of words image representation | |
Siddiqua et al. | Semantics-enhanced supervised deep autoencoder for depth image-based 3D model retrieval | |
Yu et al. | Bag of Tricks and a Strong Baseline for FGVC. | |
Dimitrovski et al. | Fast and efficient visual codebook construction for multi-label annotation using predictive clustering trees | |
Choi et al. | Scene classification via hypergraph-based semantic attributes subnetworks identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 410073 No. 47 Yanwachi Main Street, Kaifu District, Changsha City, Hunan Province Applicant after: National University of Defense Technology Address before: 410073 No. 47 Yanwachi Main Street, Kaifu District, Changsha City, Hunan Province Applicant before: NATIONAL University OF DEFENSE TECHNOLOGY |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161123 |