CN108960264A

CN108960264A - The training method and device of disaggregated model

Info

Publication number: CN108960264A
Application number: CN201710361782.5A
Authority: CN
Inventors: 刘炯宙; 夏命榛
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2018-12-07

Abstract

This application discloses a kind of training method of disaggregated model and devices, for promoting data analysis efficiency.The training method of the disaggregated model of the application includes: to receive the sample data for being used for train classification models, and sample data includes multiple sample characteristics；Target signature subset is determined from sample data, and the higher-dimension sparse features of target signature subset are determined using higher-dimension rarefaction conversion method；Determine the corresponding target data complexity of the higher-dimension sparse features of target signature subset, which includes multiple dimensions for characterize data feature；According to the mapping relations of established data complexity and sorting algorithm determine target data complexity corresponding to target classification algorithm, and according to the mapping relations of established data complexity and the hyper parameter set of target classification algorithm determine target component corresponding to target data complexity；According to determining target component and the higher-dimension sparse features training objective sorting algorithm of target signature subset, to obtain disaggregated model.

Description

The training method and device of disaggregated model

Technical field

This application involves data processing field, in particular to the training method and device of a kind of disaggregated model.

Background technique

With the arrival of big data era, information data increasingly expands, and carries out efficient robust Accurate Analysis to mass data The market demand constantly expand.Such as the credit classification of the off-network prediction of field of telecommunications, medical diagnosis, access, image mould Formula identification and network data classification etc..In this context, machine learning is widely applied, especially in machine learning points Class method is most widely used.

However, the use for classification method is but faced with numerous problems, wherein with feature selecting, eigentransformation, mould Type selection and arameter optimization are the most difficult, need to make repeated attempts, and modify, then iteration, so that data analytical cycle is long, it is at high cost. Due to feature selecting, model selection, any one link such as arameter optimization is likely to have an impact final result, so When data are analyzed, it is desirable that system integrally has higher robustness so that gone out slight problem when a link, be unlikely to Final result causes very poor influence.

But there are many factor for also just because of this, influencing data analysis, non-to the positioning and debugging cost of data analysis result Chang Gao, especially big data scene often do the analysis of data and require a great deal of time being calculated, lead to entire data Analytical cycle is too long, and data analysis efficiency is low.

Summary of the invention

This application provides a kind of training method of disaggregated model and devices, for promoting data analysis efficiency.

The application first aspect provides a kind of training method of disaggregated model, and the disaggregated model is for dividing data Class.

For the ease of extracting correlated characteristic from sample data, therefore, it is necessary first to receive being used for described in training for input The sample data of disaggregated model；Wherein, which includes multiple sample characteristics.

Then by determining target signature subset from the sample data, the multiple features for needing to use are filtered out, to subtract The calculation amount of a small number of evidences.Wherein, which is that correlation and redundancy all meet goal condition in the sample data Characteristic set.

The higher-dimension sparse features of the target signature subset are determined using higher-dimension rarefaction conversion method, the higher-dimension sparse features For linear character；LS-SVM sparseness such as is carried out to the target signature subset using Kernel-Based Methods, obtains target signature subset Higher-dimension sparse features, with improve data analysis precision.

Next determine that the corresponding target data complexity of the higher-dimension sparse features of the target signature subset, the data are complicated Degree includes multiple dimensions for characterize data feature；The sparse spy of higher-dimension that data complexity carrys out measures characteristic subset can be used Sign.

Then the target data complexity is determined according to the mapping relations of established data complexity and sorting algorithm Corresponding target classification algorithm, and the hyper parameter set according to established data complexity and the target classification algorithm Mapping relations determine target component corresponding to the target data complexity reach optimization algorithm and reduce parameter space Purpose.Wherein, the super ginseng of the mapping relations and the data complexity of the data complexity and sorting algorithm and sorting algorithm The mapping relations that manifold is closed can be obtained by pre- learning training.

Finally according to the higher-dimension sparse features of the determining target component and target signature subset training target point Class algorithm, to obtain the disaggregated model.It can be improved data analysis efficiency using the disaggregated model.

It is above-mentioned that target signature subset is determined from the sample data under a kind of implementation of first aspect, comprising:

The character subset of maximum correlation and minimum redundancy is determined from the sample data；The maximum correlation and minimum The character subset of redundancy is the target signature subset.Meet feature of maximum correlation and minimum redundancy by extracting Collection, can filter the unessential data of some degrees of association, to reduce the calculation amount of data.

Under a kind of implementation of first aspect, this determines the target signature subset using higher-dimension rarefaction conversion method Higher-dimension sparse features, comprising:

Equilibrium treatment is carried out to the target signature subset first, then adds random noise；

Then the target signature subset after above-mentioned carry out equilibrium treatment and addition random noise is split, is torn open It is divided into the first subset and second subset；

Using the first trained feature sparse coding algorithm, to obtain the extensive model of feature sparse coding；

Second subset is recently entered, the extensive model of feature sparse coding is acted in the second subset for splitting and obtaining Data, so that it is determined that go out the corresponding higher-dimension sparse features of the second subset.

Under a kind of implementation of first aspect, the above-mentioned mapping according to established data complexity and sorting algorithm Before relationship determines target classification algorithm corresponding to the target data complexity, this method further include:

The mapping relations of the training data complexity and sorting algorithm, and the training data complexity and sorting algorithm The mapping relations of hyper parameter set.The implementation obtains the mapping of the data complexity and sorting algorithm by pre- learning training The mapping relations of the hyper parameter set of relationship and the data complexity and sorting algorithm.

Under a kind of implementation of first aspect, the mapping relations of the training data complexity and sorting algorithm, with And the mapping relations of hyper parameter set of the training data complexity and sorting algorithm include:

Obtain the multiple sorting algorithms and multiple groups training data of input；

It determines in the multiple groups training data every in the corresponding sorting algorithm of every group of training data and multiple sorting algorithm The corresponding hyper parameter set of a sorting algorithm；By using the different training data of multiple classification algorithm training multiple groups, obtain every The statistical information of group training data and the appropriateness of its each sorting algorithm, which includes each sorting algorithm Classification and the corresponding parameter value range, that is, hyper parameter set of each sorting algorithm.

More parts of data complexities are obtained, which is the number of every group of training data in the multiple groups training data According to complexity；The characteristics of by using data complexity from every group of training data of multiple dimensional representations, to obtain the multiple groups instruction Practice the data complexity of every group of training data in data.

Establish the mapping relations of more parts of data complexities and multiple sorting algorithm；According to the essence of the data obtained index Degree chooses at least one higher sorting algorithm of data target precision, and establishes this part of data complexity and at least one point The mapping relations of class algorithm.For more parts of data complexities included by multiple groups training data, in the manner described above, more parts are established The mapping relations of data complexity and the multiple sorting algorithm.

Establish the mapping relations of more parts of data complexities hyper parameter set corresponding with each sorting algorithm.According to institute The precision for obtaining data target is chosen the higher one group of parameter of data target precision from the hyper parameter set and is joined as target Number, and establish the mapping relations of target component described in the hyper parameter set of this part of data complexity and the sorting algorithm.For More parts of data complexities included by multiple groups training data establish more parts of data complexities and each classification in the manner described above The mapping relations of the hyper parameter set of algorithm.

Under a kind of implementation of first aspect, which includes 12 dimensions for characterize data feature At least two in degree, which includes: linear discriminant rate, target type range Duplication, single features maximum energy Effect, linear classification error rate, linear classification minimal error and linear classification face sample proportion, similar sample gather density, difference Class sample gather density, sample data be non-linear, the super dimension closure of foreign peoples's sample variation, Different categories of samples minimum and each dimension The sparse rate of value.It is dilute as the higher-dimension for characterizing the target signature subset that at least two dimensions can be chosen from 12 dimensions Dredge the target data complexity of feature.

The application second aspect provides a kind of training device of disaggregated model, and the disaggregated model is for dividing data Class, the device include:

Transmit-Receive Unit, for receiving the sample data for training the disaggregated model, which includes multiple samples Feature；

Processing unit, for determining that target signature subset, the target signature subset are the sample number from the sample data All meet the characteristic set of goal condition according to middle correlation and redundancy；

The higher-dimension sparse features of the target signature subset are determined using higher-dimension rarefaction conversion method, the higher-dimension sparse features For linear character；

Determine the corresponding target data complexity of the higher-dimension sparse features of the target signature subset, which includes Multiple dimensions for characterize data feature；

Corresponding to the target data complexity is determined according to the mapping relations of established data complexity and sorting algorithm Target classification algorithm, and closed according to the mapping of established data complexity and the hyper parameter set of the target classification algorithm It is to determine target component corresponding to the target data complexity；

According to the training of the higher-dimension sparse features of the determining target component and the target signature subset, the target classification is calculated Method, to obtain the disaggregated model.

Under a kind of implementation of second aspect, which is used to determine target signature from the sample data Collection, comprising:

The processing unit, for determining the character subset of maximum correlation and minimum redundancy from the sample data；It should Maximum correlation and the character subset of minimum redundancy are the target signature subset.

Under a kind of implementation of second aspect, which is used to determine using higher-dimension rarefaction conversion method should The higher-dimension sparse features of target signature subset, comprising:

Then the processing unit adds random noise for carrying out equilibrium treatment to the target signature subset；

The target signature subset after progress equilibrium treatment and addition random noise is split as the first subset and second Subset；

Second subset is inputted, and determines that the corresponding higher-dimension of the second subset is sparse according to the extensive model of this feature sparse coding Feature.

Under a kind of implementation of second aspect, which is also used to:

The mapping relations of the training data complexity and sorting algorithm, and the training data complexity and sorting algorithm The mapping relations of hyper parameter set.

Under a kind of implementation of second aspect, which is used to train the data complexity and sorting algorithm Mapping relations, and the mapping relations of hyper parameter set of the training data complexity and sorting algorithm include:

The processing unit, for obtaining the multiple sorting algorithms and multiple groups training data of input；

It determines in the multiple groups training data every in the corresponding sorting algorithm of every group of training data and multiple sorting algorithm The corresponding hyper parameter set of a sorting algorithm；

More parts of data complexities are obtained, which is the number of every group of training data in the multiple groups training data According to complexity；

Establish the mapping relations of more parts of data complexities and multiple sorting algorithm；

Establish the mapping relations of more parts of data complexities hyper parameter set corresponding with each sorting algorithm.

Under a kind of implementation of second aspect, which includes 12 dimensions for characterize data feature At least two in degree, which includes: linear discriminant rate, target type range Duplication, single features maximum energy Effect, linear classification error rate, linear classification minimal error and linear classification face sample proportion, similar sample gather density, difference Class sample gather density, sample data be non-linear, the super dimension closure of foreign peoples's sample variation, Different categories of samples minimum and each dimension The sparse rate of value.

The application third aspect provides a kind of calculating equipment, including memory, transceiver, processor and bus system；

Wherein, memory is for storing program and instruction；

Transceiver for receiving or sending information under the control of a processor；

Processor is used to execute the program in memory；

Bus system for connecting memory, transceiver and processor so that memory, transceiver and processor into Row communication；

Processor is used to call the program instruction in memory, to execute any of the application first aspect or first aspect The training method of the disaggregated model provided in implementation.

The application fourth aspect provides a kind of computer readable storage medium, stores in the computer readable storage medium There is instruction, when run on a computer, so that computer executes method described in above-mentioned various aspects.

The 5th aspect of the application provides a kind of computer program product comprising instruction, when it runs on computers When, so that computer executes method described in above-mentioned various aspects.

As can be seen from the above technical solutions, the application has the following advantages: by determining that target is special from sample data Subset is levied to reduce data calculation amount；The higher-dimension sparse features of the target signature subset are determined using higher-dimension rarefaction conversion method To improve the precision of data analysis；It is multiple finally by the corresponding target data of higher-dimension sparse features for determining the target signature subset Miscellaneous degree；And corresponding to determining the target data complexity according to the mapping relations of established data complexity and sorting algorithm Target classification algorithm, and the mapping relations according to established data complexity and the hyper parameter set of the target classification algorithm Determine that target component corresponding to the target data complexity achievees the purpose that optimization algorithm and reduces parameter space；To press According to the higher-dimension sparse features training of the determining target component and the target signature subset target classification algorithm, to be divided Class model.It can be improved data analysis efficiency using the disaggregated model.

Detailed description of the invention

Fig. 1 is an organizational structure schematic diagram of the training device of disaggregated model provided herein；

Fig. 2 is an organizational structure schematic diagram of calculating equipment provided herein；

Fig. 3 is a flow diagram of the training method of disaggregated model provided herein；

Fig. 4 is another flow diagram of the training method of disaggregated model provided herein；

Fig. 5 is another flow diagram of the training method of disaggregated model provided herein；

Fig. 6 is another flow diagram of the training method of disaggregated model provided herein；

Fig. 7 is another organizational structure schematic diagram of the training device of disaggregated model provided herein.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application The embodiment of the present application is described in attached drawing.

The description and claims of this application and term " first ", " second ", " third ", " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so that the embodiments described herein can be in addition to illustrating herein Or the sequence other than the content of description is implemented.In addition, term " includes " or " having " and its any deformation, it is intended that covering is not Exclusive includes, for example, the process, method, system, product or equipment for containing a series of steps or units be not necessarily limited to it is clear Step or unit those of is listed on ground, but is not clearly listed or for these process, methods, product or is set Standby intrinsic other step or units.

Classification can regard the mapping process from a data set to one group of predetermined, non-overlapping classification as.Its The generation of middle mapping relations and the application of mapping relations are exactly the main research contents of Data Mining Classification method.Reflecting here The relationship of penetrating is exactly the classification function often said or disaggregated model (classifier), and the application of mapping relations then corresponds to us and uses classification Data item in data set is divided into the process of some classification in given classification by device.

The mathematical definition of classification: a given data set D={ t1, t2 ..., tn } and one group of class C=C1, C2 ..., Cn }, classification problem is exactly determining mapping f:D- > C, and each tuple ti is assigned in a class.Class Cj containment mapping To all data tuples in such, i.e. Cj=ti | f (ti)=Cj, 1≤i≤n, and ti ∈ D }.

It has been generally acknowledged that in classification problem, feature selecting, eigentransformation, model selection, four steps such as arameter optimization are Difficulty is maximum, four the most time-consuming steps.

So designing and developing one kind can reduce data analysis difficulty, the classification learning device of data analysis efficiency is improved It is necessary to be based on this, what this application provides a kind of based on integrated study is applied to solve the problems, such as the classification mould of data classification The training device of type.As shown in Figure 1, the training device of the disaggregated model can be a physical host, it is also possible to an object Manage mainframe cluster, be also possible to for the virtual machine that is configured of disaggregated model in training the application, no matter the training cartridge of disaggregated model How is the form set, as long as a computing engines can be equivalent to, is able to achieve the training process of the disaggregated model in the application. The device includes three cores.Wherein, first part is characterized selected section, and feature selecting part is for realizing to input The selection of the feature of data, such as extraction correlation and redundancy all meet the feature of goal condition from the sample data of input Set.Second part is characterized sparse coding part, and feature sparse coding part is mainly by using non-linear (such as core letter Number) method, rarefaction is carried out to the character subset of data, to obtain the higher-dimension sparse features of character subset.Part III is mould Type selection and super ginseng space compression part are selected since version is various and the super ginseng space of each model is huge using model Select and super ginseng space compression part in established data complexity and sorting algorithm mapping relations and data complexity with The mapping relations in the super ginseng space of sorting algorithm select suitable sorting algorithm for the sample data of input, and reduce classification and calculate Super ginseng space required for method training.It is trained for using the higher-dimension sparse features in super ginseng space and character subset after diminution defeated The selected suitable sorting algorithm of the sample data entered, to obtain disaggregated model, to improve the efficiency of data analysis.

The training device of disaggregated model in Fig. 1 can be realized by the calculating equipment 200 in Fig. 2, calculate equipment 200 Institutional framework schematic diagram can also include bus 208 as shown in Fig. 2, including processor 202, memory 204 and transceiver 206.

Wherein, the communication between processor 202, memory 204 and transceiver 206 can be realized by bus 208 Connection can also realize communication by other means such as wireless transmissions.

Memory 204 may include volatile memory (volatile memory), such as random access memory (random-access memory, RAM)；Memory also may include nonvolatile memory (non-volatile Memory), such as read-only memory (read-only memory, ROM), flash memory (flash memory), hard disk (hard disk drive, HDD) or solid state hard disk (solid state drive, SSD)；Memory 204 can also include upper State the combination of the memory of type.When by software to realize technical solution provided by the present application, for realizing the application Fig. 3 The program code of the training method of the disaggregated model of offer saves in memory 204, and is executed by processor 202.

Equipment 200 is calculated to communicate by transceiver 206 with other equipment.

Processor 202 can be central processing unit (central processing unit, CPU).

In the application, the transceiver 206, for receiving the sample data for training the disaggregated model, the sample Notebook data includes multiple sample characteristics；

The processor 202, for target signature subset determining from the sample data, the target signature subset is Correlation and redundancy all meet the characteristic set of goal condition in the sample data；

The higher-dimension sparse features of the target signature subset are determined using higher-dimension rarefaction conversion method, the higher-dimension is sparse Feature is linear character；

Determine the corresponding target data complexity of the higher-dimension sparse features of the target signature subset, the data complexity Including multiple dimensions for characterize data feature；

Determine that the target data complexity institute is right according to the mapping relations of established data complexity and sorting algorithm The target classification algorithm answered, and reflecting according to established data complexity and the hyper parameter set of the target classification algorithm The relationship of penetrating determines target component corresponding to the target data complexity；

According to the higher-dimension sparse features of the determining target component and the target signature subset training target Sorting algorithm, to obtain the disaggregated model.

Processor 202 in the application from sample data by determining target signature subset to reduce data calculation amount； The higher-dimension sparse features of the target signature subset are determined using higher-dimension rarefaction conversion method to improve the precision of data analysis； Finally by the corresponding target data complexity of higher-dimension sparse features of the determination target signature subset；And according to established The mapping relations of data complexity and sorting algorithm determine target classification algorithm corresponding to the target data complexity, and The number of targets is determined according to the mapping relations of established data complexity and the hyper parameter set of the target classification algorithm Achieve the purpose that optimization algorithm according to target component corresponding to complexity and reduces parameter space；To according to described in determining The higher-dimension sparse features of target component and the target signature subset training target classification algorithm, to obtain classification mould Type.It can be improved data analysis efficiency using the disaggregated model.

Optionally, the processor 202 is used to determine target signature subset from the sample data, comprising:

The processor 202, for determining feature of maximum correlation and minimum redundancy from the sample data Collection；The maximum correlation and the character subset of minimum redundancy are the target signature subset.

Optionally, the processor 202 is used to determine the target signature subset using higher-dimension rarefaction conversion method Higher-dimension sparse features, comprising:

Then the processor 202 adds random noise for carrying out equilibrium treatment to the target signature subset；

It will carry out equilibrium treatment and add the target signature subset after random noise to be split as the first subset and the Two subsets；

Second subset is inputted, and the corresponding higher-dimension of the second subset is determined according to the extensive model of feature sparse coding Sparse features.

Optionally, the processor 202, is also used to:

The mapping relations of the training data complexity and sorting algorithm, and the training data complexity and classification are calculated The mapping relations of the hyper parameter set of method.

Optionally, the processor 202 is used to train the mapping relations of the data complexity and sorting algorithm, Yi Jixun The mapping relations for practicing the hyper parameter set of the data complexity and sorting algorithm include:

The processor 202, for obtaining the multiple sorting algorithms and multiple groups training data of input；

Determine the corresponding sorting algorithm of every group of training data and the multiple sorting algorithm in the multiple groups training data In the corresponding hyper parameter set of each sorting algorithm；

More parts of data complexities are obtained, the more parts of data complexities are every group of training datas in the multiple groups training data Data complexity；

Establish the mapping relations of more parts of data complexities and the multiple sorting algorithm；

Optionally, the data complexity includes at least two in 12 dimensions for characterize data feature, institute Stating 12 dimensions includes: linear discriminant rate, target type range Duplication, single features maximum efficiency, linear classification error Rate, linear classification minimal error and linear classification face sample proportion, similar sample gather density, inhomogeneity sample gather density, Sample data is non-linear, the super dimension closure of foreign peoples's sample variation, Different categories of samples minimum and each sparse rate of dimension value.

Present invention also provides a kind of training method of disaggregated model, the calculating equipment 200 in Fig. 2 executes the party when running Method, flow diagram are as shown in Figure 3.

301, the sample data for training the disaggregated model is received.

The sample data includes multiple sample characteristics, and the step is by reception input for training the disaggregated model Sample data, to extract correlated characteristic from sample data.

302, target signature subset is determined from the sample data.

It should be noted that including multiple sample characteristics in the sample data of input, for example, the feature for measuring someone includes Multiple features such as height, weight, age.However and not all feature require to use, therefore, pass through the step extract mesh Character subset is marked, the multiple features for needing to use are filtered out, to reduce calculation amount.

Wherein, the target signature subset is the spy that correlation and redundancy all meet goal condition in the sample data Collection is closed.Specifically, the target signature subset is to meet the feature of maximum correlation and minimum redundancy in the sample data Subset.

It should be noted that the bigger expression correlation of mutual information is higher.But top k selected according to mutual information size The modeling effect of feature be but not necessarily it is optimal, this is because individually considering that the mutual information of each feature and target has ignored spy Association between sign, it is easy to introduce redundancy feature.Thus need while considering the correlation and redundancy of character subset.It is based on Mutual information principle, it can be deduced that the Unified Solution objective function of maximum correlation and minimum redundancy.According to maximum correlation and The Unified Solution function of minimum redundancy determines the feature for meeting maximum correlation and minimum redundancy from the sample data Subset.

Wherein, the calculation formula of mutual information is as follows:Wherein, I (x；Y) indicate that the mutual information between feature x and feature y, p (x, y) indicate the joint probability of feature x and feature y.Feature correlation Calculation formula is as follows:Wherein, S indicates the combination of all character subsets, c table Showing the target of classification, D indicates the correlation of feature and the target of classification, | S | indicate the number of all character subsets, I indicates mutual Information, x_iIndicate ith feature.Feature redundancy calculation formula is as follows: Wherein, S indicates the combination of all character subsets, and R indicates the redundancy of feature and feature, | S | indicate of all character subsets Number, I indicate mutual information, x_iIndicate ith feature, x_jIndicate j-th of feature.It is superfluous based on mutual information, feature correlation and feature The calculation method of remaining can show that the Unified Solution objective function of maximum correlation and minimum redundancy is as follows:Wherein, m-1 indicate obtained meet maximum correlation and The Characteristic Number of minimum redundancy, X indicate the set of all features, S_m-1Indicate that the m-1 having been selected meets maximum correlation With the characteristic set of minimum redundancy, x_iIndicate ith feature, x_jIndicate j-th of feature.

Maximum phase can be determined from sample data by the maximum correlation minimum redundancy Unified Solution objective function The character subset of closing property and minimum redundancy.Feature selecting part as shown in connection with fig. 1, maximum correlation which obtains and The character subset of minimum redundancy can be used as the character subset of subsequent characteristics sparse coding processing.

303, the higher-dimension sparse features of the target signature subset are determined using higher-dimension rarefaction conversion method.

The higher-dimension sparse features of the target signature subset are determined using higher-dimension rarefaction conversion method.Higher-dimension rarefaction turns Change including but not limited to non-linear (such as kernel function) method of method.Such as using Kernel-Based Methods to the target signature subset into Row LS-SVM sparseness obtains the higher-dimension sparse features of target signature subset.

Optionally, the higher-dimension sparse features that the target signature subset is determined using higher-dimension rarefaction conversion method, Include:

Equilibrium treatment is carried out to the target signature subset, after then adding random noise；

Using the first trained feature sparse coding algorithm, to obtain the extensive model of feature sparse coding；The spy Levying the extensive model of sparse coding is the nonlinear characteristic transformation model obtained using nonlinear method；

It should be noted that as shown in connection with fig. 4, the specific implementation of the process can be, first to the sample number Equilibrium treatment is carried out according to the target signature subset of middle extraction, then adds random noise；Wherein, available equalization processing method packet Include stochastical sampling (such as gibbs Gibbs sampling algorithm) and stratified sampling, available random noise include white Gaussian noise and Uniform noise etc..

Then the target signature subset after above-mentioned carry out equilibrium treatment and addition random noise is split, Middle obtained the first subset i.e. sample 1 that splits is for training characteristics sparse coding algorithm, to obtain the extensive mould of feature sparse coding Type；

The extensive model of feature sparse coding is finally acted on to the data split in obtained second subset, obtains the The higher-dimension sparse features of two subsets, that is, sample 2.

Wherein, LS-SVM sparseness is carried out to target signature subset using nonlinear method, to obtain target signature subset The mode of higher-dimension sparse features can refer to as follows:

One kind is possible to be achieved in that using three-layer neural network (number of plies can be with flexible configuration) to target signature subset Nonlinear transformation is carried out, key step is as follows:

The connection system between each layer is obtained using Back Propagation Algorithm training to the first obtained subset that splits first Number；There are a coefficient ratio, which is coefficient of connection for output between each layer of neural network.

Then each sample vi of the second subset for splitting and obtaining is traversed, wherein traversal method can be used preceding to biography Broadcasting method, to obtain each sample vi in the output of the last layer hidden layer (i.e. the second layer) of the three-layer neural network；

The higher-dimension sparse features expression of target signature subset in order to obtain can be the feature of each dimension in multiple dimensions Activation primitive is added, the multiple dimension is used to characterize the data characteristics of the target signature subset.Wherein, the activation primitive to Determine whether the feature of each dimension retains, activation primitive can choose sigmoid function, softmax function and relu letter Number etc. in any one.

Alternatively possible is achieved in that by taking random forest as an example, and in machine learning, random forest is one and includes The classifier of multiple decision trees, the application utilize Random Forest model (similar model such as iteration decision tree (Gradient Boosting Decision Tree, GBDT)) to carry out higher-dimension to the target signature subset sparse and make the data after rarefaction Have systematicness distribution, key step is as follows:

It obtains a Random Forest model first against obtained first trained of splitting (model includes more Tree)；

Then each sample vi of the second subset for splitting and obtaining is traversed, each sample vi can fall on each tree tj Some leaf node；

The feature vector for constructing a sparse matrix for each tree tj for each sample vi, by leaf where each sample vi The feature vector of node is set to (1,0), remaining is set to (0,0)；

Features described above vector is arranged in the feature vector for the sparse matrix that a length is tj in order.As vector 1 (0, 1), vector 2 (1,1), vector 3 (1,0), the feature vector of the sparse matrix obtained after arrangement are (0,1,1,1,1,0).

304, the corresponding target data complexity of the higher-dimension sparse features of the target signature subset is determined.

It should be noted that the difference between different data collection is multifarious, in order to find data set and algorithm and ginseng Several incidence relation, it is necessary to the abstract characteristics of data are portrayed using identical method.Therefore, data complexity degree of coming can be used The higher-dimension sparse features of measure feature subset.The data complexity includes multiple dimensions for characterize data feature, specifically, The data complexity may include at least two in 12 dimensions for characterize data feature.12 dimension packets Include: linear discriminant rate, target type range Duplication, single features maximum efficiency, linear classification error rate, linear classification are minimum Error and linear classification face sample proportion, similar sample gather density, inhomogeneity sample gather density, sample data be non-linear, The super dimension closure of foreign peoples's sample variation, Different categories of samples minimum and each sparse rate of dimension value.It can be from 12 dimensions Choose target data complexity of at least two dimensions as the higher-dimension sparse features for characterizing the target signature subset.

It should be noted that be based on Fishe ' s linear decision rule, the linear discriminant rate for calculate some dimension to point Analyse the discriminating power of target.Calculation formula is as follows:Wherein, μ₁、μ₂Respectively correspond the equal of two classifications Value,Respectively correspond the variance of two classifications.

Target type range Duplication, describes the Duplication of the affiliated bounds of different classifications target category, passes through Company multiplies the classification Duplication that all features are degree.Calculation formula are as follows:Wherein, MAX and MIN respectively indicates certain dimensional feature to the maximum value and minimum value of some class categories.

Single features maximum efficiency is to be fallen in except feature overlapping region by considering, and value is perpendicular to feature dimensions Spend the ratio of the sample characteristics of hyperplane.

Linear classification error rate is the linear separability degree for measuring data sample.

Linear classification minimal error and, be the minimum mistake that all sample points are measured using linear classification face to Optimal Separating Hyperplane Difference and.Formula are as follows: minimize: a^tT, constraint condition: Z^tw+t≥b,t≥0.Wherein, a, b, w are the classification of any linear classifier Hyperplane parameter.

Linear classification face sample proportion refers to the sample proportion just at classification boundaries.

Similar sample gather density is measured in sample, the aggregation tightness degree of similar sample.Formula are as follows:Wherein, intraDist (x_i) indicate sample x_iWith the most narrow spacing of other similar samples From interDist (x_i) indicate sample x_iWith the minimum range of foreign peoples's sample.

Inhomogeneity sample gather density describes the aggregation tightness degree between different classes of sample.

Sample data is non-linear to describe the non-linear of data sample.Basic skills is repeatedly to use linear partition at random Generic sample is divided in face, then calculates mistake and divides rate.

Foreign peoples's sample variation describe it is non-linear between sample and its foreign peoples, basic skills be calculate sample and its The distance relation of closest foreign peoples's convex closure.

The minimum for the super dimension closure that the super dimension closure of Different categories of samples minimum is used to describe each classification is spherical (comprising similar sample And do not have overlapping minimum spherical with other classifications) number.It is each classification by normalizing the closure ball number of whole classifications Calculating ratio.

Each sparse rate of dimension value indicates that being averaged in each dimension has value sample proportion.Calculation formula are as follows:Wherein, m is the number of samples that the dimension has value, and n is total sample number.

305, the target data complexity is determined according to the mapping relations of established data complexity and sorting algorithm Corresponding target classification algorithm, and the hyper parameter set according to established data complexity and the target classification algorithm Mapping relations determine target component corresponding to the target data complexity.

It should be noted that the mapping relations and the data complexity of the data complexity and sorting algorithm with point Creating a mechanism for the mapping relations of the hyper parameter set of class algorithm can refer to following implementation.

Optionally, the mapping relations according to established data complexity and sorting algorithm determine the target data Before target classification algorithm corresponding to complexity, the method also includes:

The mapping relations of the training data complexity and sorting algorithm, and train the data complexity and divide The mapping relations of the hyper parameter set of class algorithm include:

Determine that the corresponding sorting algorithm of every group of training data and the multiple classification in the multiple groups training data are calculated The corresponding hyper parameter set of each sorting algorithm in method；

It should be noted that as shown in connection with fig. 5, which can be by learning in advance, and training obtains data complexity and classification The mapping relations of the hyper parameter set of the mapping relations and data complexity and sorting algorithm of algorithm.

The foundation of the mapping relations of the data complexity and sorting algorithm can be instructed first by using multiple sorting algorithms Practice the different training data of multiple groups, obtains the statistical information of every group of training data and the appropriateness of its each sorting algorithm, it should Statistical information includes the classification and the corresponding parameter value range of each sorting algorithm i.e. hyper parameter collection of each sorting algorithm It closes.

The characteristics of using data complexity from every group of training data of multiple dimensional representations, to obtain the multiple groups training data In every group of training data data complexity.

The data characteristics of every group of training data are characterized with a data complexity, by every group of training data of input, Training obtains the data target between that corresponding a data complexity of every group of training data and multiple sorting algorithms.According to institute The precision of data target is obtained, chooses at least one higher sorting algorithm of data target precision, and establish this part of data complexity With the mapping relations of at least one sorting algorithm.For more parts of data complexities included by multiple groups training data, according to upper Mode is stated, the mapping relations of more parts of data complexities and the multiple sorting algorithm are established.

Since the corresponding hyper parameter set of each sorting algorithm is huge, it is therefore necessary to be carried out to the range of hyper parameter set Reduction.In the way of above-mentioned Accuracy Measure, that corresponding a data complexity of every group of training data of training and selected It is suitble to the data target of the hyper parameter set of the sorting algorithm of (precision is higher).According to the precision of the data obtained index, from described In hyper parameter set choose the higher one group of parameter of data target precision as target component, and establish this part of data complexity and The mapping relations of target component described in the hyper parameter set of the sorting algorithm.For more numbers included by multiple groups training data According to complexity, in the manner described above, the mapping relations of the hyper parameter set of more parts of data complexities and each sorting algorithm are established.

For example, initializing the set of multiple sorting algorithms, random forest (random forest, RF) such as is selected, Logistic returns (logistic regression, LR), support vector machines (support vector machine, SVM) etc. Three sorting algorithms, and number is 0,1,2 respectively.

Multiple groups difference training data is chosen, and is trained in above three sorting algorithm respectively, with training gained knot The data precision (i.e. accuracy) of fruit is Measure Indexes, obtains according to the division of precision and is suitble to divide used in each group training data Class algorithm.

According to data complexity measure, 12 of each group training data included by data complexity are respectively obtained Index under dimension.It is to divide under 12 dimensional representations of a certain group of training data included by data complexity such as the following table 1 The data acquisition system not being trained in above three sorting algorithm.

Table 1

To the data acquisition system under 12 dimensions of each group training data in table 1 included by data complexity according to precision It is divided, selects the corresponding data complexity of this group of training data to be suitble to the sorting algorithm of (corresponding precision is higher), and build Found the mapping relations of this part of data complexity Yu the sorting algorithm.In this manner, more parts of data complexities and multiple points are established The mapping relations of class algorithm.

It is closed with reference to the mapping of same method, the hyper parameter set and each part data complexity of establishing each sorting algorithm System.

According to the mapping relations of more parts of data complexities and multiple sorting algorithms of above-mentioned training, for a sample of input Notebook data obtains the target data complexity for characterizing this part of sample data feature, that is, can determine that the target data is complicated The corresponding target classification algorithm of degree；Likewise, according to the super of more parts of data complexities of above-mentioned training and each sorting algorithm The mapping relations of parameter sets can determine target component corresponding to the target data complexity.

306, according to the training of the higher-dimension sparse features of the determining target component and the target signature subset Target classification algorithm, to obtain disaggregated model.

As shown in connection with fig. 6, it is closed in applying step 305 by the mapping of the data complexity of pre- learning training and sorting algorithm The mapping relations of the hyper parameter set of system and data complexity and sorting algorithm, for the data of given a sample data Complexity can determine the selected target classification algorithm of the sample data and target component by step 305, according to determination The target component and the target signature subset higher-dimension sparse features, by integrated learning approach (such as using Bagging (bootstrap aggregating) algorithm or Boosting (adaptive boosting) algorithm) training institute State target classification algorithm, and final output disaggregated model.It can be improved data analysis efficiency using the disaggregated model.

The application from sample data by determining target signature subset to reduce data calculation amount；Utilize higher-dimension rarefaction Conversion method determines the higher-dimension sparse features of the target signature subset to improve the precision of data analysis；Finally by determining institute State the corresponding target data complexity of higher-dimension sparse features of target signature subset；And according to established data complexity and divide The mapping relations of class algorithm determine target classification algorithm corresponding to the target data complexity, and according to established number It is determined corresponding to the target data complexity according to the mapping relations of complexity and the hyper parameter set of the target classification algorithm Target component achieve the purpose that optimization algorithm and reduce parameter space；To according to the determining target component and institute The higher-dimension sparse features training target classification algorithm of target signature subset is stated, to obtain disaggregated model.Using the classification mould Type can be improved data analysis efficiency.

It should be noted that in the application, if not limiting the sequencing between each step without specified otherwise, not limiting Relation of interdependence between each step.

Present invention also provides the training device 700 of disaggregated model, which can be set by calculating shown in Fig. 2 Standby 200 realize, can also by specific integrated circuit (application-specific integrated circuit, ASIC it) realizes or programmable logic device (programmable logic device, PLD) is realized.Above-mentioned PLD can be multiple Miscellaneous programmable logic device (complex programmable logic device, CPLD), Universal Array Logic (generic Array logic, GAL) or any combination thereof.The training device 700 of the disaggregated model is for realizing disaggregated model shown in Fig. 3 Training method.When by the training method of software realization disaggregated model shown in Fig. 3, the device 700 or software mould Block.

The institutional framework schematic diagram of the training device 700 of disaggregated model as shown in fig. 7, comprises: Transmit-Receive Unit 702 and processing Unit 704.When Transmit-Receive Unit 702 works, the step 301 and step in the training method of disaggregated model shown in Fig. 3 are executed Optinal plan in 301.When processing unit 704 works, the step 302 in the training method of disaggregated model shown in Fig. 3 is executed ~306 and step 302~306 in optinal plan.It should be noted that processing unit 704 can also be by institute in such as Fig. 2 in the application The processor 202 shown realizes that Transmit-Receive Unit 702 can also be realized by transceiver 206 as shown in Figure 2.

The associated description of above-mentioned apparatus can correspond to associated description and effect refering to embodiment of the method part and be understood, This place, which is not done, excessively to be repeated.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.

The computer program product includes one or more computer instructions.Load and execute on computers the meter When calculation machine program instruction, entirely or partly generate according to process or function described in the embodiment of the present invention.The computer can To be general purpose computer, special purpose computer, computer network or other programmable devices.The computer instruction can be deposited Storage in a computer-readable storage medium, or from a computer readable storage medium to another computer readable storage medium Transmission, for example, the computer instruction can pass through wired (example from a web-site, computer, server or data center Such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave) mode to another website Website, computer, server or data center are transmitted.The computer readable storage medium can be computer and can deposit Any usable medium of storage either includes that the data storages such as one or more usable mediums integrated server, data center are set It is standby.The usable medium can be magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or partly lead Body medium (such as solid state hard disk (solid state disk, SSD)) etc..

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

In addition, each functional unit in the embodiment of the present application can integrate in one processing unit, it is also possible to each A unit physically exists alone, and can also be integrated in one unit with two or more units.Above-mentioned integrated unit was both It can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic or disk etc. are various can store program The medium of code.

The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of training method of disaggregated model, which is characterized in that the disaggregated model is for classifying to data, the side Method includes:

The sample data for training the disaggregated model is received, the sample data includes multiple sample characteristics；

From the sample data determine target signature subset, the target signature subset be the sample data in correlation and Redundancy all meets the characteristic set of goal condition；

The higher-dimension sparse features of the target signature subset, the higher-dimension sparse features are determined using higher-dimension rarefaction conversion method For linear character；

Determine the corresponding target data complexity of the higher-dimension sparse features of the target signature subset, the data complexity includes Multiple dimensions for characterize data feature；

Corresponding to the target data complexity is determined according to the mapping relations of established data complexity and sorting algorithm Target classification algorithm, and closed according to the mapping of established data complexity and the hyper parameter set of the target classification algorithm System determines target component corresponding to the target data complexity；

According to the higher-dimension sparse features of the determining target component and the target signature subset training target classification Algorithm, to obtain the disaggregated model.

2. the method according to claim 1, wherein described determine target signature from the sample data Collection, comprising:

The character subset of maximum correlation and minimum redundancy is determined from the sample data；The maximum correlation and minimum The character subset of redundancy is the target signature subset.

3. the method according to claim 1, wherein described determine the mesh using higher-dimension rarefaction conversion method Mark the higher-dimension sparse features of character subset, comprising:

Equilibrium treatment is carried out to the target signature subset, then adds random noise；

The target signature subset after progress equilibrium treatment and addition random noise is split as the first subset and the second son Collection；

Second subset is inputted, and determines that the corresponding higher-dimension of the second subset is sparse according to the extensive model of feature sparse coding Feature.

4. method according to any one of claims 1 to 3, which is characterized in that described according to established data complexity Before determining target classification algorithm corresponding to the target data complexity with the mapping relations of sorting algorithm, the method is also Include:

5. according to the method described in claim 4, it is characterized in that, the training data complexity and sorting algorithm are reflected Penetrate relationship, and the mapping relations of hyper parameter set of the training data complexity and sorting algorithm include:

It determines in the multiple groups training data every in the corresponding sorting algorithm of every group of training data and the multiple sorting algorithm The corresponding hyper parameter set of a sorting algorithm；

More parts of data complexities are obtained, the more parts of data complexities are the numbers of every group of training data in the multiple groups training data According to complexity；

6. method according to any one of claims 1 to 3, which is characterized in that the data complexity includes for characterizing At least two in 12 dimensions of data characteristics, 12 dimensions include: linear discriminant rate, target type range weight It is folded rate, single features maximum efficiency, linear classification error rate, linear classification minimal error and linear classification face sample proportion, same Class sample gather density, inhomogeneity sample gather density, sample data be non-linear, foreign peoples's sample variation, Different categories of samples are minimum Super dimension closure and each sparse rate of dimension value.

7. a kind of training device of disaggregated model, which is characterized in that the disaggregated model is for classifying to data, the dress It sets and includes:

Transmit-Receive Unit, for receiving the sample data for training the disaggregated model, the sample data includes multiple samples Feature；

Processing unit, for determining that target signature subset, the target signature subset are the sample from the sample data Correlation and redundancy all meet the characteristic set of goal condition in data；

8. device according to claim 7, which is characterized in that the processing unit is used to determine from the sample data Target signature subset, comprising:

The processing unit, for determining the character subset of maximum correlation and minimum redundancy from the sample data；Institute The character subset for stating maximum correlation and minimum redundancy is the target signature subset.

9. device according to claim 7, which is characterized in that the processing unit is used to utilize higher-dimension rarefaction conversion side Method determines the higher-dimension sparse features of the target signature subset, comprising:

10. device according to any one of claims 7 to 9, which is characterized in that the processing unit is also used to:

11. device according to claim 10, which is characterized in that the processing unit is for training the data complexity With the mapping relations of sorting algorithm, and the training data complexity and sorting algorithm hyper parameter set mapping relations packet It includes:

12. according to the described in any item devices of claim 7 to 10, which is characterized in that the data complexity includes being used for table At least two in 12 dimensions of data characteristics are levied, 12 dimensions include: linear discriminant rate, target type range Duplication, single features maximum efficiency, linear classification error rate, linear classification minimal error and linear classification face sample proportion, Similar sample gather density, inhomogeneity sample gather density, sample data be non-linear, foreign peoples's sample variation, Different categories of samples most Small super dimension closure and each sparse rate of dimension value.

13. a kind of calculating equipment characterized by comprising memory, transceiver, processor and bus system；

Wherein, the memory is for storing program and instruction；

The transceiver is for receiving or sending information under the control of the processor；

The processor is used to execute the program in the memory；

The bus system is for connecting the memory, the transceiver and the processor, so that the memory, institute It states transceiver and the processor is communicated；

The processor is used to call the program instruction in the memory, executes as described in any one of claims 1 to 6 Method.

14. a kind of computer readable storage medium, including instruction, which is characterized in that when run on a computer, make to succeed in one's scheme Calculation machine executes the method as described in claim 1 to 6 any one.

15. a kind of computer program product comprising instruction, which is characterized in that when run on a computer, so that calculating Machine executes the method as described in claim 1 to 6 any one.