CN106156029A - The uneven fictitious assets data classification method of multi-tag based on integrated study - Google Patents

The uneven fictitious assets data classification method of multi-tag based on integrated study Download PDF

Info

Publication number
CN106156029A
CN106156029A CN201510130968.0A CN201510130968A CN106156029A CN 106156029 A CN106156029 A CN 106156029A CN 201510130968 A CN201510130968 A CN 201510130968A CN 106156029 A CN106156029 A CN 106156029A
Authority
CN
China
Prior art keywords
data
label
tag
sample
fictitious assets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510130968.0A
Other languages
Chinese (zh)
Inventor
李虎
贾焰
韩伟红
李树栋
李爱平
周斌
杨树强
黄九鸣
全拥
邓璐
朱伟辉
傅翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201510130968.0A priority Critical patent/CN106156029A/en
Publication of CN106156029A publication Critical patent/CN106156029A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the uneven fictitious assets data classification method of a kind of multi-tag based on integrated study, comprise the following steps: under the distributed storage framework of fictitious assets, first having the random sampling put back to is carried out to fictitious assets data, afterwards, use feedforward neural network to learn multi-tag data, lie in the relevance between label in the neutral net connection weight training;Meanwhile, the distribution situation according to label in data from the sample survey selects to use SMOTE to sample;Finally, for promoting the Generalization Capability of grader, integrated study method is used, using neutral net as each Weak Classifier taken turns during study;Compared with prior art, the present invention is with classic algorithm Bagging in integrated study as framework, according to the feature of uneven fictitious assets data, is fused to feedforward neural network and SMOTE Sampling techniques in integrated study framework, can effectively improve the precision of classification.

Description

The uneven fictitious assets data classification method of multi-tag based on integrated study
Technical field
This technology belongs to Networks and information security field, relates to a kind of multi-tag based on integrated study uneven Fictitious assets data classification method.
Background technology
The fast development of internet is the generation of fictitious assets and transaction provides wide platform, promotes net The easy prosperity and development of cross winding.But the provider no matter concluding the business for user or fictitious assets, all suffers from virtual Asset data (includes fictitious assets merchandise news, related fictitious assets transaction data and fictitious assets behaviour Make daily record etc.) numerous and jumbled problem.These fictitious assets data are classified, people can be helped more preferable Manage and effectively improve the service efficiency of fictitious assets.
At present, China carried out based on eID domain space virtual asset management with save technical research from damage, Realize the unified and standard management to fictitious assets.Fictitious assets safety system is comprehensive and accurate be have recorded to virtual Assets commodity itself and relative various operation data, but these data on the one hand wide variety, no Information with fictitious assets itself is different, and the operation behavior mode of user varies especially, virtual to these Asset data carries out classification and faces many difficult problems.Additionally, the fictitious assets data volume difference between different classes of Relatively big, as abnormal transaction data is generally much smaller than arm's length dealing data, and abnormal transaction data have multiple can The situation of energy, as abnormal in exchange hour, dealing money exception and trading frequency are abnormal etc..Different is different Often may exist, i.e. one fictitious assets data may belong to multiple classification or be labeled with multiple simultaneously Label.For the unbalanced situation of data volume between multi-tag and classification, fictitious assets data are classified Face lot of challenges.
In traditional classification problem, each sample is pertaining only to some classification or only one of which label, belongs to single Label problem concerning study.But, as it was previously stated, a lot of samples belong simultaneously to multiple class in fictitious assets data Not.Can by such issues that be attributed to multi-tag problem concerning study.Its formal definitions is, it is assumed that data set D={x1,x2,L,xnComprising n sample, each sample has d attribute, and sample has m label L={l1,l2,L,lm}。 Then multi-tag classification problem is regarded as in the case of known training data category label, i.e.In the case of Yi Zhi, construct grader f:D → 2LSo that right In attribute is known but test set that label is unknownCan correctly enter rower Note.When training data and test data are all satisfied | Li| when=1, multi-tag classification problem deteriorates to multicategory classification problem. Particularly, as m=2, multi-tag classification problem deteriorates to two traditional class classification problems.
In recent years, multi-tag classification problem receives extensive concern and the research of people, and its solution is main Can be divided into based on algorithm adapt to method (Algorithm Adaption based methods) and turn based on problem The method (Problem Transformation based methods) [1] changed.
Based on the method for algorithm conversion mainly by the existing single labeling algorithm of transformation so that it is can process Multi-tag data.The advantage of such method is, in special practical problem, focuses on particular algorithm ratio and calculates The method of method independence is superior.This kind of method mainly includes such as decision tree, SVMs (Support Vector Machines, SVMs), nearest neighbor method (K-Nearest Neighbor, KNN) and neutral net etc..Clare Use decision tree to solve multi-tag classification problem Deng [2], the definition of entropy in single labeling problem is extended to In multi-tag classification problem.Andre etc. [3] are for multi-tag classification problem, it is proposed that based on the SVM of kernel function Method.Zhang etc. [4] are for traditional kNN method, it is proposed that MLkNN: first calculate each label Prior probability, when inputting data x to be sorted, to each label l in tag set L, respectively Calculate x to there is label l and not there is the probability of label l, and then whether integrated forecasting x can be labeled with label l. On this basis, Zhang etc. [5] further propose a kind of multi-tag Lazy learning sorting algorithm ML-kNN.Zhang etc. [6] are by defining the global optimization function for multi-tag data so that manually god Can process multi-tag data through network, its basic thought is, if multiple sample is provided simultaneously with two labels, So one of the two label occurs in that, another is likely to occurs.But such method exists Needing there is deeper understanding to corresponding algorithm and application knowledge, for domestic consumer, difficulty is higher Defect it is of limited application and be difficult to promote.
But, mainly multi-tag classification problem is converted into one group of list label based on the method that problem converts and divides Class problem, thus utilize existing single labeling method to solve this problem.Such method is owing to can directly make With existing various maturation methods, it is readily appreciated that, it therefore is widely applied.Typically turn based on problem The multiple labeling sorting technique changing thought is binary cross-linking method (Binary Relevance, BR), will each mark An independent single classification problem is regarded in the prediction signed as, and for one independent grader of each label training, Afterwards, with whole training datas, each grader is trained.But the method have ignored between label Connecting each other, effect is still unsatisfactory.BR is carried out by document [7] by the method for copy and cum rights copy Improve, multi-tag data in former training set are split into a plurality of single label data, and gives phase The weight answered.Singer etc. [8] propose based on AdaBoost's for the multi-tag classification problem of text BOOSTEXTER method, gives the relatively difficult sample of classification bigger weight every time.McCallum etc. [9] Propose the Bayesian method for multi-tag classification, use expectation maximization (Expectation Maximization, EM) learn.Ueda and Saito [10] proposes two generations of multi-tag classification Formula model PMM1 and PMM2.Read etc. [11] propose LP (Label Powerset) method, to training Every kind of tag combination in data carries out binary coding, thus forms new label.The method is by coding Mode by multi-tag data convert for single label data.But its algorithm complex is higher.Schapire etc. [12] Propose the AdaBoost.MH calculation being applicable to multi-tag data based on single labeling method AdaBoost.M1 Method, this algorithm generates q new single label training data, adds training to each multi-tag training data The quantity of data, thus add the time of model training.And these methods all relate only to single algorithm, Tsoumakas etc. [13] are on the basis of summarizing to existing algorithm, it is proposed that the multi-tag classification mould of two-layer Type, uses BR, decision tree or SVM to carry out K folding cross-training in ground floor, in the second layer, adopts Classified further by BR or SVM algorithm, achieve preferable effect.
In the studies above, simply think not when being one group of list label problem by multiple label PROBLEM DECOMPOSITION It with being independent between label, is trained on single label data i.e. every time, seldom consider label Between relevance.And in actual fictitious assets data, the relevance between label generally exists and many Plant various.Additionally, the sample size that is mostly based between classification of existing algorithm substantially quite this it is assumed that but In a lot of data including fictitious assets data, the disequilibrium between classification (label) generally exists. Additionally, part has been studied and simply used the training that single or a small amount of grader carries out multi-tag data, general Change performance to be under some influence.
[1]Tsoumakas G and Katakis I.Multi-label classification:An overview[J]. International Journal of Data Warehousing and Mining(IJDWM),2007,3(3):1-13.
[2]Clare A and King R D.Knowledge discovery in multi-label phenotype data[M]//Principles of data mining and knowledge discovery.Springer Berlin Heidelberg,2001:42-53.
[3]Elisseeff A and Weston J.A kernel method for multi-labelled classification[C]//Advances in neural information processing systems,2001: 681-687.
[4]Zhang M L and Zhou Z H.A k-nearest neighbor based algorithm for multi-label classification[C]//Granular Computing,2005IEEE International Conference on.IEEE,2005,2:718-721.
[5]Zhang M L and Zhou Z H.ML-KNN:A lazy learning approach to multi-label learning[J].Pattern recognition,2007,40(7):2038-2048.
[6]Zhang M L and Zhou Z H.Multilabel neural networks with applications to functional genomics and text categorization[J].Knowledge and Data Engineering, IEEE Transactions on,2006,18(10):1338-1351.
[7]Shen X,Boutell M,Luo J,et al.Multilabel machine learning and its application to semantic scene classification[C]//Electronic Imaging 2004. International Society for Optics and Photonics,2003:188-199.
[8]Schapire R.E.and Singer Y.BoosTexter:a boosting-based system for text categorization[J].Machine Learning,2000,39(2-3):135-168.
[9]McCallum A.Multi-label text classification with a mixture model trained by EM[C]//AAAI’99 Workshop on Text Learning,1999:1-7
[10]Ueda N.and Saito.K.Parametric mixture models for multi-label text[C]//Advances in neural information processing systems,2002:721-728.
[11]Read J.A pruned problem transformation method for multi-label classification[C]//Proc.2008 New Zealand Computer Science Research Student Conference(NZCSRS 2008),2008:143-150.
[12]Schapire R.E.and Singer Y.Improved boosting algorithms using confidence-rated predictions[J].Machine learning,1999,37(3):297-336.
[13]Tsoumakas G,Dimou A,Spyromitros E,et al.Correlation-based pruning of stacked binary relevance models for multi-label learning[C]//Proceedings of the 1st International Workshop on Learning from Multi-Label Data,2009:101-116.
Content of the invention
For above-mentioned technical problem, the present invention proposes by a kind of multi-tag based on integrated study uneven empty Intend asset data sorting technique and realize that fictitious assets data are classified, it is adaptable to virtual money numerous and jumbled on internet Produce data to classify.It is particularly suited for unbalanced multi-tag fictitious assets data classification between classification.
Technical solution of the present invention includes: description, the multi-tag imbalance of fictitious assets data storage architecture are virtual The process of asset data and the structure of grader.
1st, fictitious assets storage architecture describes
Fictitious assets storage uses Distributed Architecture, including the Organization And Management of the many structured datas of magnanimity, magnanimity The parts such as the query processing of many structured datas, service issue and DLL.
System bottom framework is deployed on traditional DCE, is realized by distributed file system Transparent access to file data on node each in DCE.Basis in distributed file system On, file system or data to distribution for the Organization And Management's subsystem responsible of the many structured datas of magnanimity are united One management, wherein, is to be organized by data and data management module completes to the unified management of file or data 's.Additionally, also include deployment in bottom DCE for the different pieces of information/file and configuration management.
The inquiry treatment subsystem of the many structured datas of magnanimity is towards magnanimity personal identification/attribute information retrieval class should With, support the efficient query processing of many structured datas, including complex data model, blended data operator scheme Etc. module.Present invention is generally directed to log analysis therein and excavate module, it is intended to utilizing data mining technology Rapidly and efficiently detect abnormal behaviour present in fictitious assets process of exchange.
Service issue, customization and DLL subsystem are the external interfaces of system, in the way of service-oriented Carry out programmatic interface definition to data, support to the SQL query of structural data, to destructuring number According to API and class SQL query;Support user in the way of service interface customization, to personal information inquiry clothes Business interface carries out self-defined.The data access interface that the present invention also can utilize system to provide realizes to virtual money Produce inquiry and the analysis of transaction data.When the reality application present invention, both can carry out the excavation of daily record with Analyze, it is also possible to carry out data query and analysis by data-interface, it is also possible to two ways is combined.Root According to the difference of practical problem, optimal mode can be used.
2nd, the structure of the process of the uneven fictitious assets data of multi-tag and grader
In multi-tag classification problem, the association type between label is various, finds out all of rule of combination meter Calculation amount is poor compared with the grader generalization that big and study obtain, and actual popularization and application effect also cannot be guaranteed. Therefore, use one not need to explain label correlation rule, be but beneficial to the sample pass corresponding with between label of study Between system and different label, the grader of relevance is very crucial.Through generalized analysis, the present invention uses Neutral net is trained study.For improving the Generalization Capability of grader, use integrated learning approach, by god Through network as each Weak Classifier taken turns during study.Meanwhile, for solving fictitious assets data Disequilibrium problem, in each uneven comparison data taken turns between according to different classes of data of integrated study Sample.Specifically comprise the following steps that
(1) feedforward neural network
The present invention uses typical multilayer feedforward neural network (Multi-Layer Feed-Forward Neural Networks), being made up of input layer, hidden layer and output layer, each layer comprises some neurons.Wherein, Input layer is responsible for receiving outer input data, and output layer will finally produce result output.Hidden layer is mainly used in The memory of neutral net.In feedforward neural network, the neuron phase all with next layer for the neuron of each layer Connect, and typically will not connect between the neuron of this layer.During neural metwork training, with minimum Change error in classification as target.The present invention uses typical gradient decline (Gradient Descent) algorithm to enter Row connection weight and bigoted renewal learning.
The present invention is to minimize error in classification as target, and its global error can be calculated by formula 1.
E = Σ i = 1 m E i - - - ( 1 )
Wherein, m is the number of sample in training set, EiIt is the error of training every time, calculate such as formula 2 institute Show.
E i = Σ j = 1 k ( c j i - d j i ) 2 - - - ( 2 )
Wherein, k is the number of label,It is the true output of neutral net,It is desired output.
Design parameter is provided that
Input layer number is the attribute number of fictitious assets data, i.e. each input layer pair Answer a fictitious assets data attribute;
Output layer neuron number is fictitious assets class label number, i.e. each output neuron correspondence one Individual label;
Hidden layer neuron number uses Baum-Haussler rule [1] to determine, specifically calculates such as formula 3 institute Show.
Nhidden≤(Ntrain·Etolerance)/(Ninputs+Noutputs) (3)
Wherein, NhiddenFor the number of hidden layer neuron, NtrainIt is the number of training sample, EtoleranceIt is neural The acceptable upper error of network, NinputsAnd NoutputsRepresent the number of input and output neuron respectively.
Excitation function uses traditional Sigmoid function.Form is as shown in Equation 4.
g ( x ) = 1 1 + e - x - - - ( 4 )
The end condition of study can be to reach maximum study number of times, it is also possible to is the mistake between adjacent study twice Difference, less than the threshold value setting, can set according to actual conditions flexibly.
The output result of neutral net is successive value, and the classification of fictitious assets data is discrete.Therefore, Output result, by setting threshold value, is converted into centrifugal pump 0 or 1 by us.Calculating process is as shown in Equation 5.
f ( out ) = 1 , out &GreaterEqual; c 0 , out < c - - - ( 5 )
Wherein, out be neutral net original output result, c for by experiment think set threshold value.Only Having when original output result is more than the threshold value setting, test sample just should be labeled with corresponding label.
(2) SMOTE sampling method
SMOTE (Synthetic Minority Over-sampling Technique) [2] is the flat of a kind of classics The method of weighing apparatus data, reduces the disequilibrium between data by way of synthesizing new minority class sample. SMOTE method based on existing minority class example in the similarity of feature space, for example xi, choose every time away from K example of its nearest neighbours.Afterwards, from this k neighbour, one is randomly choosed, it is assumed that be xknn, then newly The example x of synthesissyn=xi+(xknn-xi)× α, wherein α is a random number between 0 and 1.This Synthetic method had both avoided random over-sampling over-fitting that may be present problem, also made decision boundary to majority Class moves, thus improves the nicety of grading of minority class, is widely used.
(3) the integrated study of multi-tag unbalanced data
In machine-learning process, the precision of single grader and generalization ability are often difficult to simultaneously very high.And collect Become study to be learnt by using a series of grader, and use certain rule the result of each grader Carry out integrating thus obtain more preferable results of learning and generalization ability than single grader.The present invention is with integrated Classic algorithm Bagging [3] in habit is framework, according to the feature of uneven fictitious assets data, and will feedforward Neutral net and SMOTE Sampling techniques are fused in integrated study framework.Specific algorithm step is as follows:
Step1: after given training sample set, put back to sampling at random by k time, from training sample set Extraction obtains a training subset S ' comprising k sample.
Step2: each label L' in statistics training subsetiFrequency of occurrences fi, calculate each label frequency and the most successively Ratio f between big frequencyi/fmax(i=1, L, k).If fi/fmaxExceed the minimum scale between label frequency Threshold value, then carry out over-sampling to the sample comprising label L'i;If fi=1, i.e. comprise label L'iSample number be 1, SMOTE method now cannot be used to carry out the synthesis of new samples, therefore use simple copy strategy, obtain Replicate set;If fi> 1, then use SMOTE method to comprising label L'iSample sample, obtain Sampling set.The data that original training sample and duplication and sampling obtain the most at last merge, and obtain one The training subset S of class label approximate equilibrium ".
Step3: to set S " carries out normalized as shown in Equation 6, then data after normalization Feedforward neural network as described in (1) for the training on collection.
x i * &prime; &prime; = ( x i &prime; &prime; - min ) / ( max - min ) - - - ( 6 )
Step4: repetition step 1, to step 3, obtains N number of neural network model training.
Step5: for test sampleFor test set, r is test sample number, will It is separately input in all N number of neural network models training, and adds up its output result, obtains size Output matrix C for m × N.
Step6: initialization result tag set Ω is sky, by row traversal output matrix C, ifThen by label liAdd in tag set Ω, be not otherwise added.Traversal Complete Matrix C can finally give sample xtClass label set omega.
Compared with the prior art, present patent application is by integrated learning approach Bagging, multilayer feedforward neural network And unbalanced data Sampling techniques SMOTE method combines, applied to the uneven virtual money of multi-tag Produce in the classification of data, the precision of classification can be effectively improved.
[1]Baum E B,Haussler D.What size net gives valid generalization?[J].Neural computation,1989,1(1):151-160.
[2]Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16(1): 321-357.
[3]http://en.wikipedia.org/wiki/Bootstrap_aggregating.
Brief description
Fig. 1 magnanimity many structures fictitious assets data management system Organization Chart
Fig. 2 is algorithm flow chart
Detailed description of the invention
Further illustrate technical scheme below by detailed description of the invention:
Technical scheme includes: the description of fictitious assets storage architecture, the adopting of uneven transaction data Sample and the structure of grader.
1st, the description of fictitious assets storage architecture
Fictitious assets storage uses Distributed Architecture, and its architectural framework is as shown in Figure 1.System bottom Business Information and IT Solution Mgmt Dep Administration, on traditional DCE, is realized in DCE by distributed file system The transparent access of file data on each node.Distributed computational nodes includes 170 high-performance servers (two Intel Xeon E5640,2.66GHz;16G DDR3 internal memory;Two pieces of PCI-Express;Redundant power and Fan), built-in 1 the 1TB disk of each server, for improving stability and the bandwidth of network, join Putting two set networks, network system uses the gigabit switch of 10 48 mouthfuls to be formed by connecting.Additionally, for strengthening Disaster-tolerant backup ability, system also includes 8 dish battle arrays, 800 pieces of 1TB hard disks, 48 dish cabinets, 32 pieces of RAID Card, 8 SAN switch.On the basis of distributed file system, the tissue of the many structured datas of magnanimity with File system or data to distribution for the management subsystem responsible are managed collectively, wherein, to file or data Unified management completed by data tissue and data management module.
The inquiry treatment subsystem of the many structured datas of magnanimity is towards magnanimity personal identification/attribute information retrieval class should With, support the efficient query processing of many structured datas, including complex data model, blended data operator scheme Etc. module.Present invention is generally directed to log analysis therein and excavate module, it is intended to utilizing data mining technology Rapidly and efficiently detect abnormal behaviour present in fictitious assets process of exchange.
Service issue, customization and DLL subsystem are the external interfaces of system, in the way of service-oriented Carry out programmatic interface definition to data, support to the SQL query of structural data, to destructuring number According to API and class SQL query;Support user in the way of service interface customization, to personal information inquiry clothes Business interface carries out self-defined.The data access interface that the present invention also can utilize system to provide realizes to virtual money Produce inquiry and the analysis of transaction data.When the reality application present invention, both can carry out the excavation of daily record with Analyze, it is also possible to carry out data query and analysis by data-interface, it is also possible to two ways is combined.Root According to different practical problems, use optimal mode.
2nd, the structure of the sampling of uneven transaction data and grader
In multi-tag classification problem, the association type between label is various, finds out all of rule of combination meter Calculation amount is poor compared with the grader generalization that big and study obtain, and actual popularization and application effect cannot be guaranteed.Cause This, use one not need to explain label correlation rule, but fine can obtain learning sample pass corresponding with label Between system and different label, the grader of relevance is critical.Through generalized analysis, the present invention adopts It is trained study by neutral net.For improving the Generalization Capability of grader, use integrated learning approach, will Neutral net is as each Weak Classifier taken turns during study.Meanwhile, for solving fictitious assets data Disequilibrium problem, each imbalance taken turns between according to different classes of data at integrated study compares logarithm According to sampling.Specifically comprise the following steps that
1) feedforward neural network
The present invention uses typical multilayer feedforward neural network (Multi-Layer Feed-Forward Neural Networks), being made up of input layer, hidden layer and output layer, each layer comprises some neurons.Wherein, Input layer is responsible for receiving outer input data, and output layer will finally produce result output.Hidden layer is mainly used in The memory of neutral net.In feedforward neural network, the neuron phase all with next layer for the neuron of each layer Connect, and typically will not connect between the neuron of this layer.During neural metwork training, with minimum Change error in classification as target.The present invention uses typical gradient decline (Gradient Descent) algorithm to enter Row connection weight and bigoted renewal learning.
The present invention is to minimize error in classification as target, and its global error can be calculated by formula 1.
E = &Sigma; i = 1 m E i - - - ( 1 )
Wherein, m is the number of sample in training set, EiIt is the error of training every time, calculate such as formula 2 institute Show.
E i = &Sigma; j = 1 k ( c j i - d j i ) 2 - - - ( 2 )
Wherein, k is the number of label,It is the true output of neutral net,It is desired output.
During the training of the present embodiment, parameter is provided that
Input layer number is the attribute number of fictitious assets data, i.e. each input layer pair Answer a fictitious assets data attribute;
Output layer neuron number is fictitious assets class label number, i.e. each output neuron correspondence one Individual label;
Hidden layer neuron number uses Baum-Haussler rule [1] to determine, specifically calculates such as formula 3 institute Show.
Nhidden≤(Ntrain·Etolerance)/(Ninputs+Noutputs) (3)
Wherein, NhiddenFor the number of hidden layer neuron, NtrainIt is the number of training sample, EtoleranceIt is neural The acceptable upper error of network, NinputsAnd NoutputsRepresent the number of input and output neuron respectively.
Excitation function uses traditional Sigmoid function.Form is as shown in Equation 4.
g ( x ) = 1 1 + e - x - - - ( 4 )
The end condition of study can be to reach maximum study number of times, it is also possible to is between adjacent study twice Error, less than the threshold value setting, can set according to actual conditions flexibly.
The output result of neutral net is successive value, and the classification of fictitious assets data is discrete.Therefore, Output result, by setting threshold value, is converted into centrifugal pump 0 or 1 by us.Calculating process is as shown in Equation 5.
f ( out ) = 1 , out &GreaterEqual; c 0 , out < c - - - ( 5 )
Wherein, out be neutral net original output result, c for by experiment think set threshold value.Only Having when original output result is more than the threshold value setting, test sample just should be labeled with corresponding label.
2) SMOTE method
SMOTE method based on existing minority class example in the similarity of feature space, for example xi, every time Its k nearest example of selected distance.Afterwards, from this k neighbour, one is randomly choosed, it is assumed that be xknn, Then newly synthesized example xsyn=xi+(xknn-xi) × α, wherein α is a random number between 0 and 1. This synthetic method had both avoided random over-sampling over-fitting that may be present problem, also make decision boundary to Most classes move, thus improve the nicety of grading of minority class, are widely used.
3) the integrated study of multi-tag unbalanced data
In machine-learning process, the precision of single grader and generalization ability are often difficult to simultaneously very high.And collect Become study to be learnt by using a series of grader, and use certain rule the result of each grader Carry out integrating thus obtain more preferable results of learning and generalization ability than single grader.The present invention is with integrated Classic algorithm Bagging in habit is framework, according to the feature of uneven fictitious assets data, and will feedforward god It is fused in integrated study framework through network and SMOTE Sampling techniques.Concrete operation step is as follows:
Step1: given training sample setWherein D={x1,x2,L,xn} Comprising n sample, each sample has d attribute, L={l1,l2,L,lmComprise m label.By k time Putting back to sampling at random, extraction from training sample set obtains a training subset comprising k sample
Step2: each label L' in statistics training subsetiThe frequency of occurrences, be designated as fre={f1,f2,L,fk}.Note fmax=maxfi(i=1, L k), calculate the ratio f between each label frequency and peak frequency successivelyi/fmax (i=1, L, k).If fi/fmax≤ Threshold, Threshold are the minimum scale threshold value between label frequency, Then to comprising label L'iSample carry out over-sampling.If fi=1, i.e. comprise label L'iSample number be 1, this Shi Wufa uses the SMOTE method described in (2) to carry out the synthesis of new samples, therefore uses simple copy Strategy, obtain replicate set D1'.If fi> 1, then use SMOTE method to comprising label L'iSample Carry out over-sampling, obtain over-sampling set D'2.The number that original training sample and duplication and sampling obtain the most at last According to merging, obtain the training subset S of a class label approximate equilibrium "=(x "1,L1”),(x”2,L”2),L,(x”k',L”k')} ( x i &prime; &prime; &Element; D &cup; D 1 &prime; &cup; D 2 &prime; , L i &prime; &prime; &SubsetEqual; L ) .
Step3: to set S " carries out normalized as shown in Equation 6, then data after normalization Feedforward neural network as described in (1) for the training on collection.
x i * &prime; &prime; = ( x i &prime; &prime; - min ) / ( max - min ) - - - ( 6 )
Step4: repetition step 1, to step 3, obtains N number of neural network model training {NN1,NN2,L,NNN}。
Step5: for test sampleFor test set, r is test sample number.Will It is separately input in all N number of neural network models training, and adds up its output result, obtains size Output matrix for m × N C = c 11 c 12 L c 1 N c 21 c 22 L c 2 N M M O M c m 1 c m 2 L c mN , Wherein, cij=0 or 1, i=1 ... m, j=1 ... N.In C Every a line represents all N number of graders to test sample xtThe respective judged result of affiliated label.
Step6: initialization result tag set Ω is sky, Ω={ }.By row traversal output matrix C, ifThen label li is added in tag set Ω, Ω=Ω+{ li, otherwise do not carry out Add.Travel through Matrix C and can finally give sample xtClass label set omega.
Compared with the prior art, by integrated learning approach Bagging, multilayer feedforward neural network and imbalance Data sampling techniques SMOTE method combines, and is applied to dividing of the uneven fictitious assets data of multi-tag In class, the precision of classification can be effectively improved.
It is above having carried out exemplary description to the present invention, it is clear that the realization of the present invention is not by aforesaid way Restriction, as long as have employed the various improvement that technical solution of the present invention is carried out or not improved by the present invention's Design and technical scheme directly apply to other occasions, all within the scope of the present invention.

Claims (3)

1. the multi-tag imbalance fictitious assets data classification method based on integrated study, it is characterised in that Comprise the following steps: the uneven fictitious assets data of the description of fictitious assets data storage architecture and multi-tag Process the structure with grader;
Wherein the uneven process of fictitious assets data of multi-tag and the construction step of grader include: use god Be trained study through network, and combine integrated study method, using neutral net as each take turns study during Weak Classifier;Meanwhile, each imbalance taken turns between according to different classes of data at integrated study Comparison data is sampled.
2. the uneven fictitious assets data classification of the multi-tag based on integrated study according to claim 1 Method, it is characterised in that the process of the uneven fictitious assets data of described multi-tag and the structure of grader Step includes:
Step one, feedforward neural network;
Step 2, SMOTE sampling method;
Step 3, the integrated study of multi-tag unbalanced data.
3. grader according to claim 2 builds, in described integrated of multi-tag unbalanced data Practise in step, it is characterised in that comprise the following steps:
1) after giving training sample set S, after repeatedly putting back to sampling at random, every time from training sample set Extraction in conjunction obtains sample one training subset S ' of composition;
2) add up the frequency of occurrences of each label in training subset, calculate successively each label frequency and peak frequency it Between ratio:
If this ratio exceedes the minimum scale threshold value between label frequency, then the sample comprising each label is entered Row over-sampling;If the frequency of occurrences of certain label is 1, the sample number i.e. comprising this label is 1, then use letter Single replication strategy, obtains replicating set;It is more than 1 if there is frequency, then use SMOTE method to comprising this The sample of label is sampled, and obtains sampling set S ';
The data that original training sample and duplication and sampling obtain the most at last merge, and obtain a classification mark Sign the training subset S of approximate equilibrium ";
3) to set S " it is normalized by the formula shown in following, then data set after normalization Upper training feedforward neural network;
x i * &prime; &prime; = ( x i &prime; &prime; - min ) / ( max - min )
Wherein xi *”For subset S " i-th samples normalization process after sample, xi" be subset S " and i-th sample This;Min, max are respectively maximum and the minimum of a value of sample in this subset;
4) step 1 is repeated) to step 3), draw the neural network model after training;
5) all test samples are separately input to step 4) in gained neural network model, add up its output As a result, output matrix is obtained;
6) setting up initialization result tag set, this tag set is sky, by row traversal output matrix, passes through Most Voting principles, the output result of all graders will add up, if result exceedes half, then will Label adds in tag set, is not otherwise added, and after complete of traversal, matrix draws final sample Class label set.
CN201510130968.0A 2015-03-24 2015-03-24 The uneven fictitious assets data classification method of multi-tag based on integrated study Pending CN106156029A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510130968.0A CN106156029A (en) 2015-03-24 2015-03-24 The uneven fictitious assets data classification method of multi-tag based on integrated study

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510130968.0A CN106156029A (en) 2015-03-24 2015-03-24 The uneven fictitious assets data classification method of multi-tag based on integrated study

Publications (1)

Publication Number Publication Date
CN106156029A true CN106156029A (en) 2016-11-23

Family

ID=57339484

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510130968.0A Pending CN106156029A (en) 2015-03-24 2015-03-24 The uneven fictitious assets data classification method of multi-tag based on integrated study

Country Status (1)

Country Link
CN (1) CN106156029A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106973057A (en) * 2017-03-31 2017-07-21 浙江大学 A kind of sorting technique suitable for intrusion detection
CN107180155A (en) * 2017-04-17 2017-09-19 中国科学院计算技术研究所 A kind of disease forecasting method and system based on Manufacturing resource model
CN107870321A (en) * 2017-11-03 2018-04-03 电子科技大学 Radar range profile's target identification method based on pseudo label study
CN108153657A (en) * 2017-12-22 2018-06-12 北京交通大学 The method of large-scale data center server application Partition of role
CN109033471A (en) * 2018-09-05 2018-12-18 中国信息安全测评中心 A kind of information assets recognition methods and device
CN109190698A (en) * 2018-08-29 2019-01-11 西南大学 A kind of classifying and identifying system and method for network digital fictitious assets
CN109272003A (en) * 2017-07-17 2019-01-25 华东师范大学 A kind of method and apparatus for eliminating unknown error in deep learning model
CN110147804A (en) * 2018-05-25 2019-08-20 腾讯科技(深圳)有限公司 A kind of unbalanced data processing method, terminal and computer readable storage medium
CN110968693A (en) * 2019-11-08 2020-04-07 华北电力大学 Multi-label text classification calculation method based on ensemble learning
CN111105238A (en) * 2019-11-07 2020-05-05 中国建设银行股份有限公司 Transaction risk control method and device
CN112530595A (en) * 2020-12-21 2021-03-19 无锡市第二人民医院 Cardiovascular disease classification method and device based on multi-branch chain type neural network
CN113255831A (en) * 2021-06-23 2021-08-13 长沙海信智能系统研究院有限公司 Sample processing method, device, equipment and computer storage medium
CN113657446A (en) * 2021-07-13 2021-11-16 广东外语外贸大学 Processing method, system and storage medium of multi-label emotion classification model
CN118468151A (en) * 2024-06-28 2024-08-09 深圳市广通工程顾问有限公司 Automatic management method and system for classification of network digital virtual assets

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090097741A1 (en) * 2006-03-30 2009-04-16 Mantao Xu Smote algorithm with locally linear embedding
CN102945280A (en) * 2012-11-15 2013-02-27 翟云 Unbalanced data distribution-based multi-heterogeneous base classifier fusion classification method
CN103500205A (en) * 2013-09-29 2014-01-08 广西师范大学 Non-uniform big data classifying method
CN104091073A (en) * 2014-07-11 2014-10-08 中国人民解放军国防科学技术大学 Sampling method for unbalanced transaction data of fictitious assets
CN104102700A (en) * 2014-07-04 2014-10-15 华南理工大学 Categorizing method oriented to Internet unbalanced application flow

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090097741A1 (en) * 2006-03-30 2009-04-16 Mantao Xu Smote algorithm with locally linear embedding
CN102945280A (en) * 2012-11-15 2013-02-27 翟云 Unbalanced data distribution-based multi-heterogeneous base classifier fusion classification method
CN103500205A (en) * 2013-09-29 2014-01-08 广西师范大学 Non-uniform big data classifying method
CN104102700A (en) * 2014-07-04 2014-10-15 华南理工大学 Categorizing method oriented to Internet unbalanced application flow
CN104091073A (en) * 2014-07-11 2014-10-08 中国人民解放军国防科学技术大学 Sampling method for unbalanced transaction data of fictitious assets

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HU LI, PENG ZOU ET. AL: ""A Combination Method for Multi-Class Imbalanced Data Classification"", 《2013 10TH WEB INFORMATION SYSTERM AND APPLICATION CONFERENCE》 *
HU LI, PENG ZOU ET. AL: ""Ensemble Multi-Label Learning Based on Neural Network"", 《ICIMCS’13 PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON INTERNET MULTIMEDIA COMPUTING AND SERVICE》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106973057B (en) * 2017-03-31 2018-12-14 浙江大学 A kind of classification method suitable for intrusion detection
CN106973057A (en) * 2017-03-31 2017-07-21 浙江大学 A kind of sorting technique suitable for intrusion detection
CN107180155B (en) * 2017-04-17 2019-08-16 中国科学院计算技术研究所 A kind of disease forecasting system based on Manufacturing resource model
CN107180155A (en) * 2017-04-17 2017-09-19 中国科学院计算技术研究所 A kind of disease forecasting method and system based on Manufacturing resource model
CN109272003A (en) * 2017-07-17 2019-01-25 华东师范大学 A kind of method and apparatus for eliminating unknown error in deep learning model
CN107870321A (en) * 2017-11-03 2018-04-03 电子科技大学 Radar range profile's target identification method based on pseudo label study
CN107870321B (en) * 2017-11-03 2020-12-29 电子科技大学 Radar one-dimensional range profile target identification method based on pseudo-label learning
CN108153657A (en) * 2017-12-22 2018-06-12 北京交通大学 The method of large-scale data center server application Partition of role
CN110147804A (en) * 2018-05-25 2019-08-20 腾讯科技(深圳)有限公司 A kind of unbalanced data processing method, terminal and computer readable storage medium
CN110147804B (en) * 2018-05-25 2023-07-14 腾讯科技(深圳)有限公司 Unbalanced data processing method, terminal and computer readable storage medium
CN109190698B (en) * 2018-08-29 2022-02-11 西南大学 Classification and identification system and method for network digital virtual assets
CN109190698A (en) * 2018-08-29 2019-01-11 西南大学 A kind of classifying and identifying system and method for network digital fictitious assets
CN109033471A (en) * 2018-09-05 2018-12-18 中国信息安全测评中心 A kind of information assets recognition methods and device
CN109033471B (en) * 2018-09-05 2022-11-08 中国信息安全测评中心 Information asset identification method and device
CN111105238A (en) * 2019-11-07 2020-05-05 中国建设银行股份有限公司 Transaction risk control method and device
CN110968693A (en) * 2019-11-08 2020-04-07 华北电力大学 Multi-label text classification calculation method based on ensemble learning
CN112530595A (en) * 2020-12-21 2021-03-19 无锡市第二人民医院 Cardiovascular disease classification method and device based on multi-branch chain type neural network
CN113255831A (en) * 2021-06-23 2021-08-13 长沙海信智能系统研究院有限公司 Sample processing method, device, equipment and computer storage medium
CN113657446A (en) * 2021-07-13 2021-11-16 广东外语外贸大学 Processing method, system and storage medium of multi-label emotion classification model
CN118468151A (en) * 2024-06-28 2024-08-09 深圳市广通工程顾问有限公司 Automatic management method and system for classification of network digital virtual assets

Similar Documents

Publication Publication Date Title
CN106156029A (en) The uneven fictitious assets data classification method of multi-tag based on integrated study
Gao et al. Discriminative learning of relaxed hierarchy for large-scale visual recognition
Shen et al. Multi-level discriminative dictionary learning with application to large scale image classification
Le et al. Probabilistic latent document network embedding
Tsai et al. Evolutionary instance selection for text classification
Rashedi et al. A hierarchical clusterer ensemble method based on boosting theory
Czarnowski et al. An approach to data reduction for learning from big datasets: Integrating stacking, rotation, and agent population learning techniques
Sun et al. Boosting ant colony optimization via solution prediction and machine learning
CN103412878B (en) Document theme partitioning method based on domain knowledge map community structure
Yang et al. Local label descriptor for example based semantic image labeling
Xue Semi‐supervised convolutional generative adversarial network for hyperspectral image classification
Li et al. Feature subset selection: a correlation‐based SVM filter approach
Hao et al. Class-wise dictionary learning for hyperspectral image classification
Nareshpalsingh et al. Multi-label classification methods: A comparative study
Xu et al. Remotely sensed image classification by complex network eigenvalue and connected degree
Shi et al. Multi-label classification based on multi-objective optimization
Qin et al. A novel factor analysis-based metric learning method for kinship verification
Ganji et al. Lagrangian constrained community detection
Lin et al. The distributed system for inverted multi-index visual retrieval
Deng et al. Differences help recognition: a probabilistic interpretation
Farhangi et al. Informative visual words construction to improve bag of words image representation
Siddiqua et al. Semantics-enhanced supervised deep autoencoder for depth image-based 3D model retrieval
Yu et al. Bag of Tricks and a Strong Baseline for FGVC.
Dimitrovski et al. Fast and efficient visual codebook construction for multi-label annotation using predictive clustering trees
Choi et al. Scene classification via hypergraph-based semantic attributes subnetworks identification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 410073 No. 47 Yanwachi Main Street, Kaifu District, Changsha City, Hunan Province

Applicant after: National University of Defense Technology

Address before: 410073 No. 47 Yanwachi Main Street, Kaifu District, Changsha City, Hunan Province

Applicant before: NATIONAL University OF DEFENSE TECHNOLOGY

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161123