CN106934410A - The sorting technique and system of data - Google Patents
The sorting technique and system of data Download PDFInfo
- Publication number
- CN106934410A CN106934410A CN201511020318.7A CN201511020318A CN106934410A CN 106934410 A CN106934410 A CN 106934410A CN 201511020318 A CN201511020318 A CN 201511020318A CN 106934410 A CN106934410 A CN 106934410A
- Authority
- CN
- China
- Prior art keywords
- data
- sorting algorithm
- classification
- classifier
- collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
In the sorting technique and system of the data that the application is provided, using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired einer Primargrosse;The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;The primary test set of one labeled data classification results of selection;Primary test set is classified using first-level class device, generation is by testing classification result and has marked the secondary training set that classification results are constituted;The second sorting algorithm is selected from sorting algorithm collection;Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired secondary parameter;The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;Combination first-level class device and secondary classifier form assembled classifier, classify with to data.Accurate combining classifiers will be classified to together, the accuracy of classification can be improved.
Description
Technical field
The application is related to big data technology, more particularly to a kind of application machine learning to solve the method for data classification and be
System.
Background technology
In the process of construction of credit investigation system, introduce machine learning algorithm combined with code of points, can solve enterprise and
The quantification problem of personal credit.
Computer is learnt according to machine learning algorithm for the sample for having marked, it is thus possible to induction and conclusion goes out sample
The regularity of distribution or distribution rule of element in this between different classifications.The regularity of distribution gone out using induction and conclusion or distribution
Rule, can classify to the sample not marked, that is to say, that these elements not marked are mapped into affiliated classification
On.
In the prior art, have to the method that the credit data of crowd is classified various.Common sorting algorithm includes:
Decision tree, Bayes, k nearest neighbor, SVMs, based on correlation rule, integrated study, artificial neural network.
During sorting algorithm, the regularity of distribution of the induction and conclusion element between different classifications is utilized, can be by
The sample for having marked generates the parameter relevant with the attribute of credit data as training set.Class belonging to parameter influence element
Not.These parameters are corresponding generally with a certain middle sorting algorithm, and both are collectively referred to as disaggregated model, or grader.These parameters,
Also referred to as model parameter.In order to characterize the performance of grader, i.e. sorting algorithm and its corresponding parameter to credit data sample point
The accuracy of class, can be tested by test set.When element classification during a grader is to test set, correctly divided
The number of elements of class is more, then the performance of grader is better.
During prior art is realized, inventor has found that at least there are the following problems in the prior art:
Crowd can be several species according to Attribute transpositions such as age, educational background, conditions of assets.Generally, different grader
There are different performances when classifying to different types of credit data.That is, for same kind of crowd, no
The degree of accuracy of same grader classification is different.There is no a kind of grader in global sample, it is, having in the crowd of whole exhausted
The degree of accuracy to advantage.
Accordingly, it is desirable to provide a kind of classification degree of accuracy of data to global sample technical scheme high.
The content of the invention
The embodiment of the present application provides a kind of degree of accuracy technical scheme high of classifying of data to global sample.
Specifically, a kind of sorting technique of data, including:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired
Einer Primargrosse;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
The primary test set of one labeled data classification results of selection;
Primary test set is classified using first-level class device, generation is by testing classification result and has marked classification results and constitutes
Secondary training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired
Secondary parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier, classify with to data;
Data are classified using assembled classifier;
Wherein, the data are the characteristic vector of multidimensional attribute.
The embodiment of the present application also provides a kind of categorizing system of data, including:
Memory module, for storing the elementary training collection of labeled data classification results, primary test set, sorting algorithm
Collection;
MBM, is used for:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired
Einer Primargrosse;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
Selection one has marked the primary test set of the classification results of credit data;
Primary test set is classified using first-level class device, generation is by testing classification result and has marked classification results and constitutes
Secondary training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired
Secondary parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier;
Sort module, for being classified to data using assembled classifier.
The sorting technique and system of the data that the embodiment of the present application is provided, at least have the advantages that:
Accurate combining classifiers will be classified to together, the accuracy of classification can be improved.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen
Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In the accompanying drawings:
The process schematic of the data classification that Fig. 1 is provided for the embodiment of the present application.
Elementary training collection and the graph of a relation of secondary training set that Fig. 2 is provided for the embodiment of the present application.
The sorting technique flow chart of the data that Fig. 3 is provided for the embodiment of the present application.
The structural representation of the categorizing system of the data that Fig. 4 is used for the embodiment of the present application.
Specific embodiment
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and
Corresponding accompanying drawing is clearly and completely described to technical scheme.Obviously, described embodiment is only the application one
Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing
Go out the every other embodiment obtained under the premise of creative work, belong to the scope of the application protection.
In the construction of credit investigation system, big data technology certainly will be used.In big data technology, machine learning and data
Mining algorithm is an important ring.The evaluation and prediction quantified to enterprise and personal credit by these algorithms and model,
Such that it is able to instruct how by resources such as assets, cash flows with relatively low risk input to production, improve production efficiency.
Data in credit investigation system are the characteristic vector of multidimensional attribute.Specifically, for example, data include but is not limited to surname
Name, sex, age, occupation, house property, vehicle, marketable securities, monthly income, moon consumption, credit line, overdue number of times, maximum are overdue
The attribute of each dimension such as number of days.Can be used to quantify after the corresponding property value of the attribute of these dimensions or characteristic value are quantized
Represent the credit level or credit level of enterprise customer or personal user.
The distribution of data has the rule of cluster.Crowd is divided into some classifications, the sample of each classification can be used
Average credit value assess the credit value of each element in sample.Here element, can refer to a people.Therefore, sample
In each element credit value the degree of accuracy, depend on the degree of accuracy of element classification in sample.
Common sorting algorithm includes:Decision tree, Bayes, k nearest neighbor, SVMs, based on correlation rule, integrated
Habit, artificial neural network.
Decision tree is one of common methods for being classified with being predicted.Traditional decision-tree, is to classify to tie from labeled data
The training set induction and conclusion of fruit goes out classifying rules.Attribute i.e. for sample builds an attribute classification relational tree.Attribute classification
Relational tree selects different attributes that the relation between attribute and classification is built as the node in tree according to certain rule.
Can use and build this attribute classification relational tree from the lower recurrence in top.The leaf node of tree is each classification, non-leaf section
Point is attribute, and the line between node is the different spans of nodal community.After decision tree builds, just from decision-making tree root
Node starts from top to bottom to needing to carry out the element of classification mark, carries out the comparing of property value, finally reaches certain leaf section
Point.Classification corresponding to the leaf node is the classification of the element.Conventional decision Tree algorithms have ID3, C4.5/C5.0,
CART etc..Whether the difference of these algorithms is essentially consisted in, the strategy of Attributions selection, the structure of decision tree, using beta pruning and cut
Branch method, whether process large data sets etc..When the selection of attribute span is reasonable, classification accuracy is high.Can be by training
Collection, optimizes the parameters such as attribute span corresponding with decision tree.A kind of attainable mode is, from special parameter so that
Traditional decision-tree is for the data element classification degree of accuracy highest in training set.Generally, a kind of sorting algorithm and and sorting algorithm
Corresponding parameter is also referred to as disaggregated model or grader.From more reasonably parameter, it is, the optimization of disaggregated model or point
The optimization of class device.
Bayesian Classification Arithmetic is the algorithm classified to element based on the Bayesian formula in probability theory.The algorithm makes
With Bayesian formula, calculating elements belong to the conditional probability of each classification, the classification conduct corresponding to alternative condition maximum probability
Its classification.Common Bayesian Classification Arithmetic includes naive Bayesian, Bayesian network.Naive Bayesian, Bayesian network
Difference be the assumption that between attribute whether conditional sampling.Naive Bayesian is conditional sampling between assuming attribute, and Bayes
Network is related between assuming that part attribute.With traditional decision-tree similarly, the relevance between attribute can also
It is considered a kind of parameter corresponding with class algorithm.
K nearest neighbor algorithm is the sorting algorithm based on element.The algorithm defines a neighbor scope first, that is, set neighbours'
Number.Then, the strategies of minority are defeated come the classification belonging to decision element, i.e. majority by the way of ballot.The classification of element
Classification corresponding to major part in neighbours' element.Euclidean distance is typically all used, that is, has chosen Euclidean distance nearest K
The sample of classification is marked as the neighbours of oneself.Both the mode that neighbours' equality can have been taken to vote, it is also possible to take neighbor weight
The mode of value is voted.The mode of neighbor weight value is taken to be voted, i.e., the opinion of different neighbours there are different power
Weight.The nearer neighbor weight of general distance is bigger.Equally, the number of neighbours here, it is also possible to be considered a kind of and calculated with classification
The corresponding parameter of method.
For sorting algorithms such as SVMs, the grader based on correlation rule, integrated study, artificial neural networks
Speech, training sample error, error in classification, weighted value of attribute etc. may be considered parameter corresponding with sorting algorithm.
By training set, the corresponding parameter of sorting algorithm is optimized, the accuracy of data classification can be improved.
Fig. 1 is refer to, is the sorting technique of the data that the embodiment of the present application is provided, specifically include following steps:
S01:The elementary training collection of one labeled data classification results of selection.
Table 1
Table 1 is the signal list of the data acquisition system of labeled data classification results.In the list, the data of all users
Gather as a sample, and it is corresponding, and a user can be as sample element.Each element can have year
The attribute of the multiple dimension such as age, position.The classification results of the element of each in sample can be marked, for example with C1,
C2, C3 mode are marked.Specifically, C1, C2, C3 can take 0 value or take 1 value.
Here the elementary training collection for selecting can be the part randomly selected in data acquisition system.
S02:The first sorting algorithm is selected from sorting algorithm collection.
Sorting algorithm collection is adapted for the set of the algorithm of classification.Sorting algorithm collection can include decision tree, Bayes's classification
The many algorithms such as device, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network.Above
Decision Tree algorithms, Bayes classifier algorithm, k nearest neighbor algorithm are briefly explained, and SVMs, based on association
Remaining algorithm such as grader, integrated study, the artificial neural network of rule has special works to be subject in machine algorithm field
Illustrate, then repeat no more here.The embodiment of the present application selects an algorithm in this step, therefrom.It is of course also possible to repeat
Carry out, so as to select multiple sorting algorithms.
S03:Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets the phase
The einer Primargrosse of prestige.
For decision Tree algorithms, the span of property value or characteristic value has shadow to the classification results of data element
Ring.Further, different attributes can have different degrees of influence to the classification results of data element.Using decision tree
Algorithm is carried out during induction and conclusion goes out classifying rules, it can be assumed that a series of span of property value, it is also possible to false
If a series of weighted value of different attribute.Iterate to calculate out the optimal span of property value, or different attribute optimal power
Weight values, so that the sorting algorithm of decision tree meets desired value to the classification accuracy of the data element of elementary training collection, or with
Make the sorting algorithm of decision tree to the classification accuracy highest of the data element of elementary training collection.
For Bayes classifier algorithm, the relevance between different attribute has to the classification results of data element
Influence.Further, different attributes can have different degrees of influence to the classification results of data element.Using decision-making
Tree algorithm is carried out during induction and conclusion goes out classifying rules, it can be assumed that a series of degree of correlation between some attributes,
Assume that a series of weighted value of different attribute.The optimal correlation coefficient between attribute is iterated to calculate out, or it is different
The optimal weights value of attribute, so that Bayes classifier algorithm meets the phase to the classification accuracy of the data element of elementary training collection
Prestige value, or so that Bayes classifier algorithm is to the classification accuracy highest of the data element of elementary training collection.
For k nearest neighbor algorithm, the number of the neighbours of data element has influence to the classification results of data element.Enter
One step, different attributes can have different degrees of influence to the classification results of data element.Enter using k nearest neighbor algorithm
During row induction and conclusion goes out classifying rules, it can be assumed that a series of span of the number of neighbours, it may also assume that
The a series of weighted value of different attribute.Iterate to calculate out the optimal value of the number of neighbours, or different attribute optimal weights
Value, so that the sorting algorithm of k nearest neighbor meets desired value, or so that K to the classification accuracy of the data element of elementary training collection
Classification accuracy highest of the sorting algorithm of neighbour to the data element of elementary training collection.
Certainly, for other sorting algorithms, such as SVMs, the grader based on correlation rule, integrated
Habit, artificial neural network etc., parameter corresponding with sorting algorithm presented hereinbefore can be with identical, it is also possible to different.Finally, pass through
The use of elementary training set pair sorting algorithm, can obtain it is corresponding with the first sorting algorithm, meet desired einer Primargrosse.It is right
For the sorting algorithm of SVMs, einer Primargrosse can be including training sample error, error in classification etc..For based on pass
Join for the sorting algorithm of grader, integrated study, the artificial neural network of rule etc., einer Primargrosse can include the power of attribute
Weight values.
S04:The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse.
As the einer Primargrosse often between sorting algorithm listed above, different with some Special Categories, have
Sorting algorithm between can also have common einer Primargrosse.Sorting algorithm and corresponding einer Primargrosse, may be constructed
For the grader that data sample plays classification, or disaggregated model.These einer Primargrosses, it is also possible to be considered disaggregated model
Model parameter.
S05:The primary test set of one labeled data classification results of selection.
In the embodiment of the present application, a data acquisition system for labeled data classification results can be selected, with testing classification device
Classification accuracy.
Further, in the another embodiment that the application is provided, a kind of system of selection of primary test set is also provided.Tool
Body, be the equal N number of subdata set of sample size by data acquisition system random division;
One of Sub Data Set cooperation is taken for the primary test set;
Set remaining N-1 sub- data acquisition systems as the corresponding elementary training collection of the primary test set.
S06:Primary test set is classified using first-level class device, generation is by testing classification result and has marked classification results
The secondary training set of composition.
In a kind of attainable mode that the embodiment of the present application is provided, by data acquisition system S, random division data are for substantially
Identical J one's share of expenses for a joint undertaking data acquisition systems.Therefrom select a subdata set SjAs primary test set, remaining J-1 one's shares of expenses for a joint undertaking data
Set is used as elementary training collection corresponding with primary test set.From sorting algorithm collection { z1, z2... ... zkIn, kth is selected successively,
K ∈ (1, K) individual algorithm, is then trained with elementary training collection, obtains a grader, or be disaggregated modelIts
In-j represent with jth one's share of expenses for a joint undertaking data acquisition system Sj, as primary test set, except SjOuter J-1 one's share of expenses for a joint undertaking data acquisition systems are used as training set.
Then, with primary test set testing classification deviceA classification results Z can be obtainedK, j。ZK, jRepresent with k-th algorithm pair
The first-level class device answered is to jth one's share of expenses for a joint undertaking data acquisition system SjClassification results.By testing classification result and classification knot can be marked
Fruit constitutes secondary training set, i.e. { Z1, j, Z2, j... ... ZK, j, Yj}.Wherein YjRepresent jth one's share of expenses for a joint undertaking data acquisition system SjWhat is marked divides
Class result.
S07:The second sorting algorithm is selected from sorting algorithm collection.
It is similar with step S02, another sorting algorithm can be selected here.Certainly, sorting algorithm here can be
Decision tree, Bayes classifier, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, ANN
One kind in a kind of algorithm in network.
S08:Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets the phase
The secondary parameter of prestige.
It is similar with step S03, can obtain meet desired einer Primargrosse here.Here secondary parameter can include surveying
Try relevance, the number of the neighbours of testing classification result, the test between span, the testing classification result of classification results
At least one of training sample error, the error in classification of testing classification result, weighted value of testing classification result of classification results.
S09:The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter.
It is similar with step S04, the secondary classifier defined by the second sorting algorithm and secondary parameter can be obtained here.
S10:Combination first-level class device and secondary classifier form assembled classifier, classify with to data.
Further, in the another embodiment that the application is provided, first-level class device and secondary classifier formation group are combined
Grader is closed, is classified with to data, specifically included:Repeat to extract two kinds of sorting algorithms from sorting algorithm collection, build respectively not
Same assembled classifier undetermined;
The secondary test set of one labeled data classification results of selection;
The accuracy that the different assembled classifier undetermined of statistics is classified to secondary test set;
Selected accuracy highest assembled classifier undetermined;
Data are classified using selected assembled classifier.
Einer Primargrosse is the parameter to the attributes defining of data.Secondary parameter be to testing classification result and marked point
The parameter that class result is limited, is still the parameter to the attributes defining of data finally.Therefore, the ginseng corresponding to assembled classifier
Number is still the parameter to the attributes defining of data.Assembled classifier can classify to data.
S11:Data are classified using assembled classifier.
In the embodiment that the application is provided, using the elementary training collection, pair parameter corresponding with the first sorting algorithm
Optimize, acquisition meets desired einer Primargrosse, build the first-level class device defined by the first sorting algorithm and einer Primargrosse,
The first-level class device of the corresponding optimization of each sorting algorithm can be obtained by the step.Further, using first-level class
Device is classified to primary test set, generates testing classification result, such that it is able to select accuracy rate highest one in testing classification result
Level grader, that is to say, that optimal first-level class device in various sorting algorithms can be obtained by the step.Further
, using secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired secondary parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter, can be obtained and one-level by the step
Grader has most strong complementary secondary classifier, that is to say, that by the combination of these steps, be finally obtained classification accurate
True property highest assembled classifier.
Above is the sorting technique of the data that the embodiment of the present application is provided, based on same thinking, refer to Fig. 4, this Shen
A kind of categorizing system 1 of data is please also provided, including:
Memory module 11, for storing the elementary training collection of labeled data classification results, primary test set, sorting algorithm
Collection;
MBM 12, is used for:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired
Einer Primargrosse;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
The primary test set of one labeled data classification results of selection;
Primary test set is classified using first-level class device, generation is by testing classification result and has marked classification results and constitutes
Secondary training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired
Secondary parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier;
Sort module 13, for being classified to data using assembled classifier.
Further, in the another embodiment that the application is provided, memory module 11 stores the classification knot of labeled data
The data acquisition system of fruit;
MBM 12, is used for:
It is equal J sub- data acquisition system of sample size by data acquisition system random division;
One of Sub Data Set cooperation is taken for the primary test set;
Set remaining J-1 sub- data acquisition systems as the corresponding elementary training collection of the primary test set.
Further, in the another embodiment that the application is provided, the first sorting algorithm at least includes decision tree, Bayes
A kind of calculation in grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network
Method;
Relevance, the number of neighbours, training sample mistake of the einer Primargrosse at least between span, attribute including attribute
At least one of difference, error in classification, weighted value of attribute.
Further, in the another embodiment that the application is provided, the second sorting algorithm at least includes decision tree, Bayes
A kind of calculation in grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network
Method;
Relevance, survey of the secondary parameter at least between span, testing classification result including testing classification result
Try number, the training sample error of testing classification result, the error in classification of testing classification result, the test of the neighbours of classification results
At least one of weighted value of classification results.
Further, in the another embodiment that the application is provided, the MBM 12 is used for:Combination first-level class
Device and secondary classifier form assembled classifier, classify with to data, specifically for:
Repeat to extract two kinds of sorting algorithms from sorting algorithm collection, different assembled classifiers undetermined are built respectively;
The secondary test set of one classification results of labeled data of selection;
The accuracy that the different assembled classifier undetermined of statistics is classified to secondary test set;
Selected accuracy highest assembled classifier undetermined;
Data are classified using selected assembled classifier.
The application provide embodiment in, the application provide embodiment in, using the elementary training collection, pair with
The corresponding parameter of first sorting algorithm is optimized, and acquisition meets desired einer Primargrosse, and structure is by the first sorting algorithm and just
The first-level class device of level parameter definition, the first-level class of the corresponding optimization of each sorting algorithm can be obtained by the step
Device.Further, primary test set is classified using first-level class device, generates testing classification result, such that it is able to select test
Accuracy rate highest first-level class device in classification results, that is to say, that can be obtained in various sorting algorithms most by the step
Excellent first-level class device.Further, using secondary training set, pair parameter corresponding with the second sorting algorithm is optimized,
Acquisition meets desired secondary parameter;The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter,
Can be obtained by the step has most strong complementary secondary classifier with first-level class utensil, that is to say, that walked by these
Rapid combination, is finally obtained classification accuracy highest assembled classifier.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions
The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided
The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable numerical value processing equipments is instructed to produce
A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable numerical value processing equipments
The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or other programmable numerical value processing equipments with spy
In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger
Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable numerical value processing equipments so that in meter
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium
Example.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by any method
Or technology realizes information Store.Information can be computer-readable instruction, value structure, the module of program or other numerical value.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus
Or any other non-transmission medium, can be used to store the information that can be accessed by a computing device.Defined according to herein, calculated
Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as numerical signal and carrier wave of modulation.
Also, it should be noted that term " including ", "comprising" or its any other variant be intended to nonexcludability
Comprising so that process, method, commodity or equipment including a series of key elements not only include those key elements, but also wrapping
Include other key elements being not expressly set out, or also include for this process, method, commodity or equipment is intrinsic wants
Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described
Also there is other identical element in process, method, commodity or the equipment of element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product.
Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Form.And, the application can be used to be can use in one or more computers for wherein including computer usable program code and deposited
The shape of the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
Embodiments herein is the foregoing is only, the application is not limited to.For those skilled in the art
For, the application can have various modifications and variations.It is all any modifications made within spirit herein and principle, equivalent
Replace, improve etc., within the scope of should be included in claims hereof.
Claims (10)
1. a kind of sorting technique of data, it is characterised in that including:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired primary
Parameter;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
The primary test set of one labeled data classification results of selection;
Primary test set is classified using first-level class device, generation by testing classification result and marked that classification results constitute it is secondary
Level training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired secondary
Parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier, classify with to data;
Data are classified using assembled classifier;
Wherein, the data are the characteristic vector of multidimensional attribute.
2. the method for claim 1, it is characterised in that the elementary training collection and the primary test set meet following
Relation:
It is equal J sub- data acquisition system of sample size by data acquisition system random division;
One of Sub Data Set cooperation is taken for the primary test set;
Set remaining J-1 sub- data acquisition systems as the corresponding elementary training collection of the primary test set.
3. the method for claim 1, it is characterised in that first sorting algorithm at least includes decision tree, Bayes
A kind of calculation in grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network
Method;
Relevance, the number of neighbours, training sample mistake of the einer Primargrosse at least between span, attribute including attribute
At least one of difference, error in classification, weighted value of attribute.
4. the method for claim 1, it is characterised in that second sorting algorithm at least includes decision tree, Bayes
A kind of calculation in grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network
Method;
Relevance, survey of the secondary parameter at least between span, testing classification result including testing classification result
Try number, the training sample error of testing classification result, the error in classification of testing classification result, the test of the neighbours of classification results
At least one of weighted value of classification results.
5. the method for claim 1, it is characterised in that combination first-level class device and secondary classifier form assembled classification
Device, classifies with to data, specifically includes:
Repeat to extract two kinds of sorting algorithms from sorting algorithm collection, different assembled classifiers undetermined are built respectively;
The secondary test set of one classification results of labeled data of selection;
The accuracy that the different assembled classifier undetermined of statistics is classified to secondary test set;
Selected accuracy highest assembled classifier undetermined;
Credit data is classified using selected assembled classifier.
6. a kind of categorizing system of data, it is characterised in that including:
Memory module, for storing the elementary training collection of labeled data classification results, primary test set, sorting algorithm collection;
MBM, is used for:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired primary
Parameter;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
Selection one has marked the primary test set of the classification results of credit data;
Primary test set is classified using first-level class device, generation by testing classification result and marked that classification results constitute it is secondary
Level training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired secondary
Parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier;
Sort module, for being classified to data using assembled classifier.
7. categorizing system as claimed in claim 6, it is characterised in that the memory module, the classification of storage labeled data
The data acquisition system of result;
The MBM, is used for:
It is equal J sub- data acquisition system of sample size by data acquisition system random division;
One of Sub Data Set cooperation is taken for the primary test set;
Set remaining J-1 sub- data acquisition systems as the corresponding elementary training collection of the primary test set.
8. categorizing system as claimed in claim 6, it is characterised in that first sorting algorithm at least includes decision tree, shellfish
In leaf this grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network one
Plant algorithm;
Relevance, the number of neighbours, training sample mistake of the einer Primargrosse at least between span, attribute including attribute
At least one of difference, error in classification, weighted value of attribute.
9. categorizing system as claimed in claim 6, it is characterised in that second sorting algorithm at least includes decision tree, shellfish
In leaf this grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network one
Plant algorithm;
Relevance, survey of the secondary parameter at least between span, testing classification result including testing classification result
Try number, the training sample error of testing classification result, the error in classification of testing classification result, the test of the neighbours of classification results
At least one of weighted value of classification results.
10. categorizing system as claimed in claim 6, it is characterised in that the MBM, is used for:Combination first-level class device
Assembled classifier is formed with secondary classifier, is classified with to data, specifically for:
Repeat to extract two kinds of sorting algorithms from sorting algorithm collection, different assembled classifiers undetermined are built respectively;
The secondary test set of one classification results of labeled data of selection;
The accuracy that the different assembled classifier undetermined of statistics is classified to secondary test set;
Selected accuracy highest assembled classifier undetermined;
Data are classified using selected assembled classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511020318.7A CN106934410A (en) | 2015-12-30 | 2015-12-30 | The sorting technique and system of data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511020318.7A CN106934410A (en) | 2015-12-30 | 2015-12-30 | The sorting technique and system of data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106934410A true CN106934410A (en) | 2017-07-07 |
Family
ID=59441495
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511020318.7A Pending CN106934410A (en) | 2015-12-30 | 2015-12-30 | The sorting technique and system of data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106934410A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107560845A (en) * | 2017-09-18 | 2018-01-09 | 华北电力大学 | A kind of Fault Diagnosis of Gear Case method for building up and device |
CN109087145A (en) * | 2018-08-13 | 2018-12-25 | 阿里巴巴集团控股有限公司 | Target group's method for digging, device, server and readable storage medium storing program for executing |
CN109324604A (en) * | 2018-11-29 | 2019-02-12 | 中南大学 | A kind of intelligent train resultant fault analysis method based on source signal |
CN110134646A (en) * | 2019-05-24 | 2019-08-16 | 安徽芃睿科技有限公司 | The storage of knowledge platform service data and integrated approach and system |
CN111861055A (en) * | 2019-04-28 | 2020-10-30 | 阿里巴巴集团控股有限公司 | Resource scheduling method, device and platform |
CN112396114A (en) * | 2020-11-20 | 2021-02-23 | 中国科学院深圳先进技术研究院 | Evaluation system, evaluation method and related product |
CN112507170A (en) * | 2020-12-01 | 2021-03-16 | 平安医疗健康管理股份有限公司 | Data asset directory construction method based on intelligent decision and related equipment thereof |
CN112801233A (en) * | 2021-04-07 | 2021-05-14 | 杭州海康威视数字技术股份有限公司 | Internet of things equipment honeypot system attack classification method, device and equipment |
-
2015
- 2015-12-30 CN CN201511020318.7A patent/CN106934410A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107560845A (en) * | 2017-09-18 | 2018-01-09 | 华北电力大学 | A kind of Fault Diagnosis of Gear Case method for building up and device |
CN109087145A (en) * | 2018-08-13 | 2018-12-25 | 阿里巴巴集团控股有限公司 | Target group's method for digging, device, server and readable storage medium storing program for executing |
CN109324604A (en) * | 2018-11-29 | 2019-02-12 | 中南大学 | A kind of intelligent train resultant fault analysis method based on source signal |
CN111861055A (en) * | 2019-04-28 | 2020-10-30 | 阿里巴巴集团控股有限公司 | Resource scheduling method, device and platform |
CN110134646A (en) * | 2019-05-24 | 2019-08-16 | 安徽芃睿科技有限公司 | The storage of knowledge platform service data and integrated approach and system |
CN110134646B (en) * | 2019-05-24 | 2021-09-07 | 安徽芃睿科技有限公司 | Knowledge platform service data storage and integration method and system |
CN112396114A (en) * | 2020-11-20 | 2021-02-23 | 中国科学院深圳先进技术研究院 | Evaluation system, evaluation method and related product |
CN112507170A (en) * | 2020-12-01 | 2021-03-16 | 平安医疗健康管理股份有限公司 | Data asset directory construction method based on intelligent decision and related equipment thereof |
CN112801233A (en) * | 2021-04-07 | 2021-05-14 | 杭州海康威视数字技术股份有限公司 | Internet of things equipment honeypot system attack classification method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106934410A (en) | The sorting technique and system of data | |
US11797838B2 (en) | Efficient convolutional network for recommender systems | |
Bifet et al. | Extremely fast decision tree mining for evolving data streams | |
CN102567464B (en) | Based on the knowledge resource method for organizing of expansion thematic map | |
Utari et al. | Implementation of data mining for drop-out prediction using random forest method | |
US8615478B2 (en) | Using affinity measures with supervised classifiers | |
CN106528874A (en) | Spark memory computing big data platform-based CLR multi-label data classification method | |
CN101807254A (en) | Implementation method for data characteristic-oriented synthetic kernel support vector machine | |
Yu et al. | Decision tree modeling for ranking data | |
AlMana et al. | An overview of inductive learning algorithms | |
CN110689368A (en) | Method for designing advertisement click rate prediction system in mobile application | |
CN105808582A (en) | Parallel generation method and device of decision tree on the basis of layered strategy | |
Jha et al. | Criminal behaviour analysis and segmentation using k-means clustering | |
CN107066328A (en) | The construction method of large-scale data processing platform | |
CN107193940A (en) | Big data method for optimization analysis | |
Zhang et al. | Research on borrower's credit classification of P2P network loan based on LightGBM algorithm | |
Farooq | Genetic algorithm technique in hybrid intelligent systems for pattern recognition | |
Bakhtyar et al. | Freight transport prediction using electronic waybills and machine learning | |
Zeng et al. | Decision tree classification model for popularity forecast of Chinese colleges | |
Jesus et al. | Dynamic feature selection based on pareto front optimization | |
CN109784354A (en) | Based on the non-parametric clustering method and electronic equipment for improving classification effectiveness | |
Woma et al. | Comparisons of community detection algorithms in the YouTube network | |
Gupta et al. | Feature selection: an overview | |
Ma | The Research of Stock Predictive Model based on the Combination of CART and DBSCAN | |
CN107103095A (en) | Method for computing data based on high performance network framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170707 |
|
RJ01 | Rejection of invention patent application after publication |