CN106156029A

CN106156029A - The uneven fictitious assets data classification method of multi-tag based on integrated study

Info

Publication number: CN106156029A
Application number: CN201510130968.0A
Authority: CN
Inventors: 李虎; 贾焰; 韩伟红; 李树栋; 李爱平; 周斌; 杨树强; 黄九鸣; 全拥; 邓璐; 朱伟辉; 傅翔
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2016-11-23

Abstract

The invention discloses the uneven fictitious assets data classification method of a kind of multi-tag based on integrated study, comprise the following steps: under the distributed storage framework of fictitious assets, first having the random sampling put back to is carried out to fictitious assets data, afterwards, use feedforward neural network to learn multi-tag data, lie in the relevance between label in the neutral net connection weight training；Meanwhile, the distribution situation according to label in data from the sample survey selects to use SMOTE to sample；Finally, for promoting the Generalization Capability of grader, integrated study method is used, using neutral net as each Weak Classifier taken turns during study；Compared with prior art, the present invention is with classic algorithm Bagging in integrated study as framework, according to the feature of uneven fictitious assets data, is fused to feedforward neural network and SMOTE Sampling techniques in integrated study framework, can effectively improve the precision of classification.

Description

The uneven fictitious assets data classification method of multi-tag based on integrated study

Technical field

This technology belongs to Networks and information security field, relates to a kind of multi-tag based on integrated study uneven Fictitious assets data classification method.

Background technology

The fast development of internet is the generation of fictitious assets and transaction provides wide platform, promotes net The easy prosperity and development of cross winding.But the provider no matter concluding the business for user or fictitious assets, all suffers from virtual Asset data (includes fictitious assets merchandise news, related fictitious assets transaction data and fictitious assets behaviour Make daily record etc.) numerous and jumbled problem.These fictitious assets data are classified, people can be helped more preferable Manage and effectively improve the service efficiency of fictitious assets.

At present, China carried out based on eID domain space virtual asset management with save technical research from damage, Realize the unified and standard management to fictitious assets.Fictitious assets safety system is comprehensive and accurate be have recorded to virtual Assets commodity itself and relative various operation data, but these data on the one hand wide variety, no Information with fictitious assets itself is different, and the operation behavior mode of user varies especially, virtual to these Asset data carries out classification and faces many difficult problems.Additionally, the fictitious assets data volume difference between different classes of Relatively big, as abnormal transaction data is generally much smaller than arm's length dealing data, and abnormal transaction data have multiple can The situation of energy, as abnormal in exchange hour, dealing money exception and trading frequency are abnormal etc..Different is different Often may exist, i.e. one fictitious assets data may belong to multiple classification or be labeled with multiple simultaneously Label.For the unbalanced situation of data volume between multi-tag and classification, fictitious assets data are classified Face lot of challenges.

In traditional classification problem, each sample is pertaining only to some classification or only one of which label, belongs to single Label problem concerning study.But, as it was previously stated, a lot of samples belong simultaneously to multiple class in fictitious assets data Not.Can by such issues that be attributed to multi-tag problem concerning study.Its formal definitions is, it is assumed that data set D={x₁,x₂,L,x_nComprising n sample, each sample has d attribute, and sample has m label L={l₁,l₂,L,l_m}。 Then multi-tag classification problem is regarded as in the case of known training data category label, i.e.In the case of Yi Zhi, construct grader f:D → 2^LSo that right In attribute is known but test set that label is unknownCan correctly enter rower Note.When training data and test data are all satisfied | L_i| when=1, multi-tag classification problem deteriorates to multicategory classification problem. Particularly, as m=2, multi-tag classification problem deteriorates to two traditional class classification problems.

In recent years, multi-tag classification problem receives extensive concern and the research of people, and its solution is main Can be divided into based on algorithm adapt to method (Algorithm Adaption based methods) and turn based on problem The method (Problem Transformation based methods) [1] changed.

Based on the method for algorithm conversion mainly by the existing single labeling algorithm of transformation so that it is can process Multi-tag data.The advantage of such method is, in special practical problem, focuses on particular algorithm ratio and calculates The method of method independence is superior.This kind of method mainly includes such as decision tree, SVMs (Support Vector Machines, SVMs), nearest neighbor method (K-Nearest Neighbor, KNN) and neutral net etc..Clare Use decision tree to solve multi-tag classification problem Deng [2], the definition of entropy in single labeling problem is extended to In multi-tag classification problem.Andre etc. [3] are for multi-tag classification problem, it is proposed that based on the SVM of kernel function Method.Zhang etc. [4] are for traditional kNN method, it is proposed that MLkNN: first calculate each label Prior probability, when inputting data x to be sorted, to each label l in tag set L, respectively Calculate x to there is label l and not there is the probability of label l, and then whether integrated forecasting x can be labeled with label l. On this basis, Zhang etc. [5] further propose a kind of multi-tag Lazy learning sorting algorithm ML-kNN.Zhang etc. [6] are by defining the global optimization function for multi-tag data so that manually god Can process multi-tag data through network, its basic thought is, if multiple sample is provided simultaneously with two labels, So one of the two label occurs in that, another is likely to occurs.But such method exists Needing there is deeper understanding to corresponding algorithm and application knowledge, for domestic consumer, difficulty is higher Defect it is of limited application and be difficult to promote.

But, mainly multi-tag classification problem is converted into one group of list label based on the method that problem converts and divides Class problem, thus utilize existing single labeling method to solve this problem.Such method is owing to can directly make With existing various maturation methods, it is readily appreciated that, it therefore is widely applied.Typically turn based on problem The multiple labeling sorting technique changing thought is binary cross-linking method (Binary Relevance, BR), will each mark An independent single classification problem is regarded in the prediction signed as, and for one independent grader of each label training, Afterwards, with whole training datas, each grader is trained.But the method have ignored between label Connecting each other, effect is still unsatisfactory.BR is carried out by document [7] by the method for copy and cum rights copy Improve, multi-tag data in former training set are split into a plurality of single label data, and gives phase The weight answered.Singer etc. [8] propose based on AdaBoost's for the multi-tag classification problem of text BOOSTEXTER method, gives the relatively difficult sample of classification bigger weight every time.McCallum etc. [9] Propose the Bayesian method for multi-tag classification, use expectation maximization (Expectation Maximization, EM) learn.Ueda and Saito [10] proposes two generations of multi-tag classification Formula model PMM1 and PMM2.Read etc. [11] propose LP (Label Powerset) method, to training Every kind of tag combination in data carries out binary coding, thus forms new label.The method is by coding Mode by multi-tag data convert for single label data.But its algorithm complex is higher.Schapire etc. [12] Propose the AdaBoost.MH calculation being applicable to multi-tag data based on single labeling method AdaBoost.M1 Method, this algorithm generates q new single label training data, adds training to each multi-tag training data The quantity of data, thus add the time of model training.And these methods all relate only to single algorithm, Tsoumakas etc. [13] are on the basis of summarizing to existing algorithm, it is proposed that the multi-tag classification mould of two-layer Type, uses BR, decision tree or SVM to carry out K folding cross-training in ground floor, in the second layer, adopts Classified further by BR or SVM algorithm, achieve preferable effect.

In the studies above, simply think not when being one group of list label problem by multiple label PROBLEM DECOMPOSITION It with being independent between label, is trained on single label data i.e. every time, seldom consider label Between relevance.And in actual fictitious assets data, the relevance between label generally exists and many Plant various.Additionally, the sample size that is mostly based between classification of existing algorithm substantially quite this it is assumed that but In a lot of data including fictitious assets data, the disequilibrium between classification (label) generally exists. Additionally, part has been studied and simply used the training that single or a small amount of grader carries out multi-tag data, general Change performance to be under some influence.

[1]Tsoumakas G and Katakis I.Multi-label classification:An overview[J]. International Journal of Data Warehousing and Mining(IJDWM),2007,3(3):1-13.

[2]Clare A and King R D.Knowledge discovery in multi-label phenotype data[M]//Principles of data mining and knowledge discovery.Springer Berlin Heidelberg,2001:42-53.

[3]Elisseeff A and Weston J.A kernel method for multi-labelled classification[C]//Advances in neural information processing systems,2001: 681-687.

[4]Zhang M L and Zhou Z H.A k-nearest neighbor based algorithm for multi-label classification[C]//Granular Computing,2005IEEE International Conference on.IEEE,2005,2:718-721.

[5]Zhang M L and Zhou Z H.ML-KNN:A lazy learning approach to multi-label learning[J].Pattern recognition,2007,40(7):2038-2048.

[6]Zhang M L and Zhou Z H.Multilabel neural networks with applications to functional genomics and text categorization[J].Knowledge and Data Engineering, IEEE Transactions on,2006,18(10):1338-1351.

[7]Shen X,Boutell M,Luo J,et al.Multilabel machine learning and its application to semantic scene classification[C]//Electronic Imaging 2004. International Society for Optics and Photonics,2003:188-199.

[8]Schapire R.E.and Singer Y.BoosTexter:a boosting-based system for text categorization[J].Machine Learning,2000,39(2-3):135-168.

[9]McCallum A.Multi-label text classification with a mixture model trained by EM[C]//AAAI’99 Workshop on Text Learning,1999:1-7

[10]Ueda N.and Saito.K.Parametric mixture models for multi-label text[C]//Advances in neural information processing systems,2002:721-728.

[11]Read J.A pruned problem transformation method for multi-label classification[C]//Proc.2008 New Zealand Computer Science Research Student Conference(NZCSRS 2008),2008:143-150.

[12]Schapire R.E.and Singer Y.Improved boosting algorithms using confidence-rated predictions[J].Machine learning,1999,37(3):297-336.

[13]Tsoumakas G,Dimou A,Spyromitros E,et al.Correlation-based pruning of stacked binary relevance models for multi-label learning[C]//Proceedings of the 1st International Workshop on Learning from Multi-Label Data,2009:101-116.

Content of the invention

For above-mentioned technical problem, the present invention proposes by a kind of multi-tag based on integrated study uneven empty Intend asset data sorting technique and realize that fictitious assets data are classified, it is adaptable to virtual money numerous and jumbled on internet Produce data to classify.It is particularly suited for unbalanced multi-tag fictitious assets data classification between classification.

Technical solution of the present invention includes: description, the multi-tag imbalance of fictitious assets data storage architecture are virtual The process of asset data and the structure of grader.

1st, fictitious assets storage architecture describes

Fictitious assets storage uses Distributed Architecture, including the Organization And Management of the many structured datas of magnanimity, magnanimity The parts such as the query processing of many structured datas, service issue and DLL.

System bottom framework is deployed on traditional DCE, is realized by distributed file system Transparent access to file data on node each in DCE.Basis in distributed file system On, file system or data to distribution for the Organization And Management's subsystem responsible of the many structured datas of magnanimity are united One management, wherein, is to be organized by data and data management module completes to the unified management of file or data 's.Additionally, also include deployment in bottom DCE for the different pieces of information/file and configuration management.

The inquiry treatment subsystem of the many structured datas of magnanimity is towards magnanimity personal identification/attribute information retrieval class should With, support the efficient query processing of many structured datas, including complex data model, blended data operator scheme Etc. module.Present invention is generally directed to log analysis therein and excavate module, it is intended to utilizing data mining technology Rapidly and efficiently detect abnormal behaviour present in fictitious assets process of exchange.

Service issue, customization and DLL subsystem are the external interfaces of system, in the way of service-oriented Carry out programmatic interface definition to data, support to the SQL query of structural data, to destructuring number According to API and class SQL query；Support user in the way of service interface customization, to personal information inquiry clothes Business interface carries out self-defined.The data access interface that the present invention also can utilize system to provide realizes to virtual money Produce inquiry and the analysis of transaction data.When the reality application present invention, both can carry out the excavation of daily record with Analyze, it is also possible to carry out data query and analysis by data-interface, it is also possible to two ways is combined.Root According to the difference of practical problem, optimal mode can be used.

2nd, the structure of the process of the uneven fictitious assets data of multi-tag and grader

In multi-tag classification problem, the association type between label is various, finds out all of rule of combination meter Calculation amount is poor compared with the grader generalization that big and study obtain, and actual popularization and application effect also cannot be guaranteed. Therefore, use one not need to explain label correlation rule, be but beneficial to the sample pass corresponding with between label of study Between system and different label, the grader of relevance is very crucial.Through generalized analysis, the present invention uses Neutral net is trained study.For improving the Generalization Capability of grader, use integrated learning approach, by god Through network as each Weak Classifier taken turns during study.Meanwhile, for solving fictitious assets data Disequilibrium problem, in each uneven comparison data taken turns between according to different classes of data of integrated study Sample.Specifically comprise the following steps that

(1) feedforward neural network

The present invention uses typical multilayer feedforward neural network (Multi-Layer Feed-Forward Neural Networks), being made up of input layer, hidden layer and output layer, each layer comprises some neurons.Wherein, Input layer is responsible for receiving outer input data, and output layer will finally produce result output.Hidden layer is mainly used in The memory of neutral net.In feedforward neural network, the neuron phase all with next layer for the neuron of each layer Connect, and typically will not connect between the neuron of this layer.During neural metwork training, with minimum Change error in classification as target.The present invention uses typical gradient decline (Gradient Descent) algorithm to enter Row connection weight and bigoted renewal learning.

The present invention is to minimize error in classification as target, and its global error can be calculated by formula 1.

E = Σ_{i = 1}^{m} E_{i} - - - (1)

Wherein, m is the number of sample in training set, E_iIt is the error of training every time, calculate such as formula 2 institute Show.

E_{i} = Σ_{j = 1}^{k} {(c_{j}^{i} - d_{j}^{i})}^{2} - - - (2)

Wherein, k is the number of label,It is the true output of neutral net,It is desired output.

Design parameter is provided that

Input layer number is the attribute number of fictitious assets data, i.e. each input layer pair Answer a fictitious assets data attribute；

Output layer neuron number is fictitious assets class label number, i.e. each output neuron correspondence one Individual label；

Hidden layer neuron number uses Baum-Haussler rule [1] to determine, specifically calculates such as formula 3 institute Show.

N_hidden≤(N_train·E_tolerance)/(N_inputs+N_outputs) (3)

Wherein, N_hiddenFor the number of hidden layer neuron, N_trainIt is the number of training sample, E_toleranceIt is neural The acceptable upper error of network, N_inputsAnd N_outputsRepresent the number of input and output neuron respectively.

Excitation function uses traditional Sigmoid function.Form is as shown in Equation 4.

g (x) = \frac{1}{1 + e^{- x}} - - - (4)

The end condition of study can be to reach maximum study number of times, it is also possible to is the mistake between adjacent study twice Difference, less than the threshold value setting, can set according to actual conditions flexibly.

The output result of neutral net is successive value, and the classification of fictitious assets data is discrete.Therefore, Output result, by setting threshold value, is converted into centrifugal pump 0 or 1 by us.Calculating process is as shown in Equation 5.

f (out) = \{\begin{matrix} 1, out &GreaterEqual; c \\ 0, out < c \end{matrix} - - - (5)

Wherein, out be neutral net original output result, c for by experiment think set threshold value.Only Having when original output result is more than the threshold value setting, test sample just should be labeled with corresponding label.

(2) SMOTE sampling method

SMOTE (Synthetic Minority Over-sampling Technique) [2] is the flat of a kind of classics The method of weighing apparatus data, reduces the disequilibrium between data by way of synthesizing new minority class sample. SMOTE method based on existing minority class example in the similarity of feature space, for example x_i, choose every time away from K example of its nearest neighbours.Afterwards, from this k neighbour, one is randomly choosed, it is assumed that be x_knn, then newly The example x of synthesis_syn=x_i+(x_knn-x_i)× α, wherein α is a random number between 0 and 1.This Synthetic method had both avoided random over-sampling over-fitting that may be present problem, also made decision boundary to majority Class moves, thus improves the nicety of grading of minority class, is widely used.

(3) the integrated study of multi-tag unbalanced data

In machine-learning process, the precision of single grader and generalization ability are often difficult to simultaneously very high.And collect Become study to be learnt by using a series of grader, and use certain rule the result of each grader Carry out integrating thus obtain more preferable results of learning and generalization ability than single grader.The present invention is with integrated Classic algorithm Bagging [3] in habit is framework, according to the feature of uneven fictitious assets data, and will feedforward Neutral net and SMOTE Sampling techniques are fused in integrated study framework.Specific algorithm step is as follows:

Step1: after given training sample set, put back to sampling at random by k time, from training sample set Extraction obtains a training subset S ' comprising k sample.

Step2: each label L' in statistics training subset_iFrequency of occurrences fi, calculate each label frequency and the most successively Ratio f between big frequency_i/f_max(i=1, L, k).If f_i/f_maxExceed the minimum scale between label frequency Threshold value, then carry out over-sampling to the sample comprising label L'i；If f_i=1, i.e. comprise label L'_iSample number be 1, SMOTE method now cannot be used to carry out the synthesis of new samples, therefore use simple copy strategy, obtain Replicate set；If f_i＞ 1, then use SMOTE method to comprising label L'_iSample sample, obtain Sampling set.The data that original training sample and duplication and sampling obtain the most at last merge, and obtain one The training subset S of class label approximate equilibrium ".

Step3: to set S " carries out normalized as shown in Equation 6, then data after normalization Feedforward neural network as described in (1) for the training on collection.

x_{i}^{*''} = (x_{i}^{''} - \min) / (\max - \min) - - - (6)

Step4: repetition step 1, to step 3, obtains N number of neural network model training.

Step5: for test sampleFor test set, r is test sample number, will It is separately input in all N number of neural network models training, and adds up its output result, obtains size Output matrix C for m × N.

Step6: initialization result tag set Ω is sky, by row traversal output matrix C, ifThen by label l_iAdd in tag set Ω, be not otherwise added.Traversal Complete Matrix C can finally give sample x^tClass label set omega.

Compared with the prior art, present patent application is by integrated learning approach Bagging, multilayer feedforward neural network And unbalanced data Sampling techniques SMOTE method combines, applied to the uneven virtual money of multi-tag Produce in the classification of data, the precision of classification can be effectively improved.

[1]Baum E B,Haussler D.What size net gives valid generalization？[J].Neural computation,1989,1(1):151-160.

[2]Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16(1): 321-357.

[3]http://en.wikipedia.org/wiki/Bootstrap_aggregating.

Brief description

Fig. 1 magnanimity many structures fictitious assets data management system Organization Chart

Fig. 2 is algorithm flow chart

Detailed description of the invention

Further illustrate technical scheme below by detailed description of the invention:

Technical scheme includes: the description of fictitious assets storage architecture, the adopting of uneven transaction data Sample and the structure of grader.

1st, the description of fictitious assets storage architecture

Fictitious assets storage uses Distributed Architecture, and its architectural framework is as shown in Figure 1.System bottom Business Information and IT Solution Mgmt Dep Administration, on traditional DCE, is realized in DCE by distributed file system The transparent access of file data on each node.Distributed computational nodes includes 170 high-performance servers (two Intel Xeon E5640,2.66GHz；16G DDR3 internal memory；Two pieces of PCI-Express；Redundant power and Fan), built-in 1 the 1TB disk of each server, for improving stability and the bandwidth of network, join Putting two set networks, network system uses the gigabit switch of 10 48 mouthfuls to be formed by connecting.Additionally, for strengthening Disaster-tolerant backup ability, system also includes 8 dish battle arrays, 800 pieces of 1TB hard disks, 48 dish cabinets, 32 pieces of RAID Card, 8 SAN switch.On the basis of distributed file system, the tissue of the many structured datas of magnanimity with File system or data to distribution for the management subsystem responsible are managed collectively, wherein, to file or data Unified management completed by data tissue and data management module.

Service issue, customization and DLL subsystem are the external interfaces of system, in the way of service-oriented Carry out programmatic interface definition to data, support to the SQL query of structural data, to destructuring number According to API and class SQL query；Support user in the way of service interface customization, to personal information inquiry clothes Business interface carries out self-defined.The data access interface that the present invention also can utilize system to provide realizes to virtual money Produce inquiry and the analysis of transaction data.When the reality application present invention, both can carry out the excavation of daily record with Analyze, it is also possible to carry out data query and analysis by data-interface, it is also possible to two ways is combined.Root According to different practical problems, use optimal mode.

2nd, the structure of the sampling of uneven transaction data and grader

In multi-tag classification problem, the association type between label is various, finds out all of rule of combination meter Calculation amount is poor compared with the grader generalization that big and study obtain, and actual popularization and application effect cannot be guaranteed.Cause This, use one not need to explain label correlation rule, but fine can obtain learning sample pass corresponding with label Between system and different label, the grader of relevance is critical.Through generalized analysis, the present invention adopts It is trained study by neutral net.For improving the Generalization Capability of grader, use integrated learning approach, will Neutral net is as each Weak Classifier taken turns during study.Meanwhile, for solving fictitious assets data Disequilibrium problem, each imbalance taken turns between according to different classes of data at integrated study compares logarithm According to sampling.Specifically comprise the following steps that

1) feedforward neural network

E = Σ_{i = 1}^{m} E_{i} - - - (1)

E_{i} = Σ_{j = 1}^{k} {(c_{j}^{i} - d_{j}^{i})}^{2} - - - (2)

During the training of the present embodiment, parameter is provided that

N_hidden≤(N_train·E_tolerance)/(N_inputs+N_outputs) (3)

Wherein, N_hiddenFor the number of hidden layer neuron, Nt_rainIt is the number of training sample, E_toleranceIt is neural The acceptable upper error of network, N_inputsAnd N_outputsRepresent the number of input and output neuron respectively.

g (x) = \frac{1}{1 + e^{- x}} - - - (4)

The end condition of study can be to reach maximum study number of times, it is also possible to is between adjacent study twice Error, less than the threshold value setting, can set according to actual conditions flexibly.

f (out) = \{\begin{matrix} 1, out &GreaterEqual; c \\ 0, out < c \end{matrix} - - - (5)

2) SMOTE method

SMOTE method based on existing minority class example in the similarity of feature space, for example x_i, every time Its k nearest example of selected distance.Afterwards, from this k neighbour, one is randomly choosed, it is assumed that be x_knn, Then newly synthesized example x_syn=x_i+(x_knn-x_i) × α, wherein α is a random number between 0 and 1. This synthetic method had both avoided random over-sampling over-fitting that may be present problem, also make decision boundary to Most classes move, thus improve the nicety of grading of minority class, are widely used.

3) the integrated study of multi-tag unbalanced data

In machine-learning process, the precision of single grader and generalization ability are often difficult to simultaneously very high.And collect Become study to be learnt by using a series of grader, and use certain rule the result of each grader Carry out integrating thus obtain more preferable results of learning and generalization ability than single grader.The present invention is with integrated Classic algorithm Bagging in habit is framework, according to the feature of uneven fictitious assets data, and will feedforward god It is fused in integrated study framework through network and SMOTE Sampling techniques.Concrete operation step is as follows:

Step1: given training sample setWherein D={x₁,x₂,L,x_n} Comprising n sample, each sample has d attribute, L={l₁,l₂,L,l_mComprise m label.By k time Putting back to sampling at random, extraction from training sample set obtains a training subset comprising k sample

Step2: each label L' in statistics training subset_iThe frequency of occurrences, be designated as fre={f₁,f₂,L,f_k}.Note f_max=maxf_i(i=1, L k), calculate the ratio f between each label frequency and peak frequency successively_i/f_max (i=1, L, k).If f_i/f_max≤ Threshold, Threshold are the minimum scale threshold value between label frequency, Then to comprising label L'_iSample carry out over-sampling.If f_i=1, i.e. comprise label L'_iSample number be 1, this Shi Wufa uses the SMOTE method described in (2) to carry out the synthesis of new samples, therefore uses simple copy Strategy, obtain replicate set D₁'.If f_i＞ 1, then use SMOTE method to comprising label L'_iSample Carry out over-sampling, obtain over-sampling set D'₂.The number that original training sample and duplication and sampling obtain the most at last According to merging, obtain the training subset S of a class label approximate equilibrium "=(x "₁,L₁”),(x”₂,L”₂),L,(x”_k',L”_k')}

(x_{i}^{''} &Element; D \cup D_{1}^{'} \cup D_{2}^{'}, L_{i}^{''} &SubsetEqual; L) .

x_{i}^{*''} = (x_{i}^{''} - \min) / (\max - \min) - - - (6)

Step4: repetition step 1, to step 3, obtains N number of neural network model training {NN₁,NN₂,L,NN_N}。

Step5: for test sampleFor test set, r is test sample number.Will It is separately input in all N number of neural network models training, and adds up its output result, obtains size Output matrix for m × N

C = [\begin{matrix} c_{11} & c_{12} & L & c_{1 N} \\ c_{21} & c_{22} & L & c_{2 N} \\ M & M & O & M \\ c_{m 1} & c_{m 2} & L & c_{mN} \end{matrix}],

Wherein, c_ij=0 or 1, i=1 ... m, j=1 ... N.In C Every a line represents all N number of graders to test sample x^tThe respective judged result of affiliated label.

Step6: initialization result tag set Ω is sky, Ω={ }.By row traversal output matrix C, ifThen label li is added in tag set Ω, Ω=Ω+{ l_i, otherwise do not carry out Add.Travel through Matrix C and can finally give sample x^tClass label set omega.

Compared with the prior art, by integrated learning approach Bagging, multilayer feedforward neural network and imbalance Data sampling techniques SMOTE method combines, and is applied to dividing of the uneven fictitious assets data of multi-tag In class, the precision of classification can be effectively improved.

It is above having carried out exemplary description to the present invention, it is clear that the realization of the present invention is not by aforesaid way Restriction, as long as have employed the various improvement that technical solution of the present invention is carried out or not improved by the present invention's Design and technical scheme directly apply to other occasions, all within the scope of the present invention.

Claims

1. the multi-tag imbalance fictitious assets data classification method based on integrated study, it is characterised in that Comprise the following steps: the uneven fictitious assets data of the description of fictitious assets data storage architecture and multi-tag Process the structure with grader；

Wherein the uneven process of fictitious assets data of multi-tag and the construction step of grader include: use god Be trained study through network, and combine integrated study method, using neutral net as each take turns study during Weak Classifier；Meanwhile, each imbalance taken turns between according to different classes of data at integrated study Comparison data is sampled.

2. the uneven fictitious assets data classification of the multi-tag based on integrated study according to claim 1 Method, it is characterised in that the process of the uneven fictitious assets data of described multi-tag and the structure of grader Step includes:

Step one, feedforward neural network；

Step 2, SMOTE sampling method；

Step 3, the integrated study of multi-tag unbalanced data.

3. grader according to claim 2 builds, in described integrated of multi-tag unbalanced data Practise in step, it is characterised in that comprise the following steps:

1) after giving training sample set S, after repeatedly putting back to sampling at random, every time from training sample set Extraction in conjunction obtains sample one training subset S ' of composition；

2) add up the frequency of occurrences of each label in training subset, calculate successively each label frequency and peak frequency it Between ratio:

If this ratio exceedes the minimum scale threshold value between label frequency, then the sample comprising each label is entered Row over-sampling；If the frequency of occurrences of certain label is 1, the sample number i.e. comprising this label is 1, then use letter Single replication strategy, obtains replicating set；It is more than 1 if there is frequency, then use SMOTE method to comprising this The sample of label is sampled, and obtains sampling set S '；

The data that original training sample and duplication and sampling obtain the most at last merge, and obtain a classification mark Sign the training subset S of approximate equilibrium "；

3) to set S " it is normalized by the formula shown in following, then data set after normalization Upper training feedforward neural network；

x_{i}^{*''} = (x_{i}^{''} - \min) / (\max - \min)

Wherein x_i ^*”For subset S " i-th samples normalization process after sample, x_i" be subset S " and i-th sample This；Min, max are respectively maximum and the minimum of a value of sample in this subset；

4) step 1 is repeated) to step 3), draw the neural network model after training；

5) all test samples are separately input to step 4) in gained neural network model, add up its output As a result, output matrix is obtained；

6) setting up initialization result tag set, this tag set is sky, by row traversal output matrix, passes through Most Voting principles, the output result of all graders will add up, if result exceedes half, then will Label adds in tag set, is not otherwise added, and after complete of traversal, matrix draws final sample Class label set.