CN107203774A - The method and device that the belonging kinds of data are predicted - Google Patents

The method and device that the belonging kinds of data are predicted Download PDF

Info

Publication number
CN107203774A
CN107203774A CN201610153303.6A CN201610153303A CN107203774A CN 107203774 A CN107203774 A CN 107203774A CN 201610153303 A CN201610153303 A CN 201610153303A CN 107203774 A CN107203774 A CN 107203774A
Authority
CN
China
Prior art keywords
data
data group
category
judgment
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610153303.6A
Other languages
Chinese (zh)
Inventor
杜玮
王晓光
施兴
景艺亮
漆远
褚崴
张柯
张舒
余舟华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610153303.6A priority Critical patent/CN107203774A/en
Publication of CN107203774A publication Critical patent/CN107203774A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of method that belonging kinds to data are predicted, including:Obtain some data for having determined that belonging kinds;The data for characterizing same user are packed into a data group;For each data group, the classification of the data according to included in the data group determines the belonging kinds of the data group;In the way of a data group is as a training sample, the decision tree for being classified to data group is trained;When receiving the request that the belonging kinds for treating prediction data are predicted, the classification of the attribution data to be predicted is predicted using the decision tree.So, one training sample is integrally used as relative to using the data in a data group as a training sample using a data group, it can to set more reasonable as the model parameter of the decision tree of operational model, so as to improve the accuracy for predicting the belonging kinds of data as the decision tree of operational model.

Description

The method and device that the belonging kinds of data are predicted
Technical field
The application, which is related under machine learning field, more particularly to a kind of big data environment, uses machine learning algorithm The method and device that the belonging kinds of data are predicted.
Background technology
During the excavation and use of user data, the overdue and non-overdue data of user be related to The evaluation or assessment of the credit at family, with important effect.With developing rapidly for internet finance, credit The scale of data is hundreds of millions of.Large-scale data dependence is in operational model so that the credit to user is closed Reason ground is assessed, so that management and control credit risk.Rational operational model requires the model ginseng set in operational model Number is reasonable.Machine learning algorithm can be learnt by history credit data, so as to be set in operational model Model parameter tend to be reasonable.These history credit datas, generally can be according to its function to operational model It is divided into training set and test set.Training set to propose model parameter it is assumed that and test set to vacation If model parameter tested.When the model parameter of the setting in operational model, operational model can be caused When reaching qualified accuracy rate to the classification of the history credit data in test set, then it is considered that operational model Tend to be reasonable.
Training set statistically can be understood as a sample, corresponding, each credit in training set Data can be understood as the individual in sample.One credit data, in other words a credit data individual The feature category of multiple dimensions such as overdue duration, the overdue amount of money, assets, age, the occupation of user can be included Property, specific numerical value is corresponding characteristic value.Model parameter in operational model can include the tool of characteristic value Degree of correlation between the setting of body number range, characteristic attribute etc..
During prior art is realized, inventor has found that at least there are the following problems in the prior art:
It is to be entered with a credit data individual for unit in the training of the model parameter of current operational model Row training.It is assumed that have a user in the current monthly overdue refund of generation, and the user spends in the month before It is then normal to refund, that is, not overdue refund.Use model of the two credit datas individual to operational model Parameter, when being trained by way of machine learning algorithm, due to the object of the two credit datas individual Value is almost completely the same, and final classification results are different, and such credit data individual can not give depanning The improved direction of shape parameter.
So, the operational model of acquisition is easily caused to user when the credit data to user is classified Classification accuracy reduction.
Accordingly, it is desirable to provide the method and device that a kind of belonging kinds to data are predicted, to use Machine learning algorithm can cause the model parameter of operational model to set more reasonable, so as to improve operational model To the accuracy of the belonging kinds prediction of data.
The content of the invention
The embodiment of the present application provides what a kind of high belonging kinds to data of the accuracy of prediction were predicted Method.
Specifically, a kind of method that belonging kinds to data are predicted, including:
Obtain some data for having determined that belonging kinds;
The data for characterizing same user are packed into a data group;
For each data group, the classification of the data according to included in the data group determines the data group Belonging kinds;
In the way of a data group is as a training sample, train for determining for being classified to data group Plan tree;
It is pre- using the decision tree when receiving the request that the belonging kinds for treating prediction data are predicted Survey the classification of the attribution data to be predicted.
The embodiment of the present application also provides the device that a kind of belonging kinds to data are predicted, including:
Receiving module, for obtaining some data for having determined that belonging kinds;
Packetization module, for the data for characterizing same user will be packed into a data group;
Pretreatment module, for for each data group, the classification of the data according to included in the data group, Determine the belonging kinds of the data group;
Modeling module, in the way of by a data group as a training sample, is trained for data The decision tree that group is classified;
Prediction module, for when receiving the request that the belonging kinds for treating prediction data are predicted, making The classification of the attribution data to be predicted is predicted with the decision tree.
The method and apparatus being predicted to the belonging kinds of data that the embodiment of the present application is provided, at least have Following beneficial effect:
So, using a data group integrally as a training sample relative to each in a data group Data can to set more as the model parameter of the decision tree of operational model as a training sample Rationally, so as to improve the accuracy predicted as the decision tree of operational model the belonging kinds of data.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, The schematic description and description of the application is used to explain the application, does not constitute the improper limit to the application It is fixed.In the accompanying drawings:
The method flow diagram that the belonging kinds of data are predicted that Fig. 1 provides for the embodiment of the present application.
The schematic diagram for the decision tree that Fig. 2 provides for the embodiment of the present application.
The apparatus structure that the belonging kinds of data the are predicted signal that Fig. 3 provides for the embodiment of the present application Figure.
Embodiment
It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer Apply example and technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, it is described Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made Example, belongs to the scope of the application protection.
Refer to Fig. 1, during the prediction for the belonging kinds that grouped data is treated using computer, can according to Lower step is carried out:
S100:Obtain some data for having determined that belonging kinds.
Server obtains some data for having determined that belonging kinds first.These obtained have determined that belonging kinds Data, can be used for the parameter for adjusting the operational model of prediction data belonging kinds, that is, it is follow-up certainly The characteristic attribute and characteristic value of the node of plan tree.The data point reuse decision tree of belonging kinds is had determined that using these Node characteristic attribute and the mode of characteristic value, rear extended meeting elaborates, and here is omitted.Here " some ", it is generally understood that certain sample size that the scale of these data reaches.
Specifically, for example, the server of the financial institution offered loans can obtain in 3 years, the financial machine The credit record of the registered user of structure.For example, " user name:First;House property:2;Marital status:It is married; Monthly income:5000 yuan;Overdue refund duration:0 month;Overdue refund number of times:0 ... " it is so some Bar credit record.
S200:The data for characterizing same user are packed into a data group.
These data according to the cognizance code of user, can be packed into some data groups by server.Its In, the data that each data group is stored have the cognizance code of same user.The cognizance code of user can To be ID card No. of user etc..
Specifically, for example, the credit record of user can be divided according to some cycles.Specifically, Credit record that can be by user in past 3 years, according to natural season, calendar year, fiscal quarter, The suitable cycle such as fiscal year is divided, meanwhile, the credit record packing shape of same user will be belonged to Into a data group.No user credit record is accommodated in different data groups.All data can be by It is packed into some data groups.
S300:For each data group, the classification of the data according to included in the data group determines the number According to the belonging kinds of group.
Further, in the another embodiment that the application is provided, the data according to included in the data group Classification, determine the data group ownership classification, specifically include:
When there are the data for belonging to first category in all data included in data group, determine that data group is returned The classification of category is first category;
When all data included in data group belong to second category, the classification of data group ownership is determined For second category.
Specifically, for example, the credit record of user's first is packed into a data group.It is assumed that user's first Credit record has " user name using fiscal quarter as the cycle in the data group of a fiscal year:First;Room Production:2;Marital status:It is married;Monthly income:5000 yuan;Overdue refund duration:1 month;It is overdue to go back Money number of times:1 ... " such credit record.Then, the belonging kinds for determining data group are first category. The belonging kinds of data group are that the implication of first category is:As long as the corresponding user of data group has overdue refund Credit record, the belonging kinds of the data group are first category.
And the credit record of user's first is assumed using fiscal quarter as the cycle, in the data group of a fiscal year All credit records are " user name:First;House property:2;Marital status:It is married;Monthly income:5000 Member;Overdue refund duration:0 month;Overdue refund number of times:0……”.That is, in an accounting In 4 fiscal quarter credit records in year, the credit record in the absence of overdue refund.Then, number is determined It is second category according to a group classification for ownership.The belonging kinds of data group are that the implication of second category is:If number There is no the credit record of overdue refund according to corresponding user is organized, then the belonging kinds of the data group are second category.
It should be pointed out that the form of credit record provided herein is only a kind of illustrative example, credit Composition, form, storage class of record etc. are not construed as substantive to one kind of the embodiment of the present application Limitation.
It should also be noted that classification here, should not straitly be interpreted as it is only overdue, non-it is overdue this A kind of concrete form.For example, legitimacy and the occasion of illegalities judgement in internet text, classification Differentiation can also be legal with illegal differentiation.It will be further appreciated that the differentiation of this classification is also It can be " 0 " or " 1 ", " true " or " false " in the "Yes" in Boolean calculation and " non-", computer Form is distinguished Deng specific.Certainly, in some scenarios, classification even is not excluded for including some non-bipartite shapes The classification of formula, for example, classification can include the situation of more than three.
S400:In the way of a data group is as a training sample, train for being divided data group The decision tree of class.
Decision tree is a tree-shaped decision diagram of additional probability result, is intuitively to be analyzed with statistical probability Method.Decision tree is a forecast model in machine learning, it can represent characteristic attribute and characteristic value it Between a kind of mapping.Fig. 2 is refer to, is a schematic diagram for decision tree.Node 1 in decision tree includes Characteristic attribute, characteristic value.Certainly some other operators can also be included.For example " > " " < " "! (logic NOT) " etc..Characteristic attribute, characteristic value and corresponding operator can sentencing as node 1 Broken strip part, its branch 2 represents to meet the feature of Rule of judgment.The leaf node 3 of decision tree tip represents number According to the classification results of the prediction belonged to.
The key problem of decision tree is the characteristic attribute of node, characteristic value determination.The node of decision tree can divide For the root node positioned at top, the leaf node positioned at tip and directly or indirectly with root node and leaf section The nodes at different levels of point connection.The embodiment of the present application is intended to each data group step-by-step classifier by decision tree, finally The classification of each data group is set to be consistent with the classification of the known ownership of each data group.Based on this purpose, the application is real Applying example needs to train a kind of decision tree for meeting three below requirement:
It is required that one:Root node for the decision tree is, it is necessary to find the corresponding Rule of judgment of root node so that Can be two set by each data component according to the Rule of judgment, all data groups in a set are all returned Belong to first category, the collection is collectively referred to as pure sample this collection, then both comprising belonging to the first kind in another set Other data group, also comprising the data group for belonging to second category, mixing sample collection is collectively referred to as by the collection;
It is required that two:Each child node for the decision tree, it is necessary to find the corresponding Rule of judgment of the child node, Allow and continued each data group that mixing sample is concentrated according to the Rule of judgment to be divided into smaller pure sample sheet Collection and mixing sample collection;
It is required that three:Tip leaf node for the decision tree is, it is necessary to which to find the tip leaf node corresponding Rule of judgment so that each data component that can be concentrated mixing sample according to the Rule of judgment is final two Individual pure sample set, each data group of one of this concentration of pure sample belongs to first category, another pure sample Each data group of this concentration belongs to second category.
Meet above three requirement decision tree as shown in Fig. 2 the decision tree include root node 1, first point The 2, second branch 4 of branch, child node 5 and leaf node 3.Each data group is according to the judgement bar of root node 1 Part, is divided into two set.One of set for it is being represented by the first branch 2, meeting the Rule of judgment, The pure sample set being made up of the data group of first category, that another gathers to be represented by the second branch 4, The aggregate sample that data group that do not meet the Rule of judgment, the data group by first category and second category mixes This set.The mixing sample set mixed by the data group of first category and the data group of second category, according to The Rule of judgment of child node 5, is further divided into and meets the Rule of judgment, by first category of child node 5 The pure sample set of data group composition and data Rule of judgment, by second category for not meeting child node 5 The pure sample set that group is constituted.The pure sample set that data group is distinguished according to the Rule of judgment of root node 1, And the pure sample separated according to child nodes 5 at different levels is originally integrated into the decision tree, in the form of leaf node 3 Showed.
Because each data group is the data group of known belonging kinds.It is understood that all data In group, the classification that some data group is divided by decision tree is with the classification belonged to known to these data groups Consistent.The ratio that the quantity of the partial data group accounts for the quantity of all data groups is higher, it may be said that bright to determine The accuracy of plan tree prediction is higher, illustrates that the Rule of judgment of the node of the decision tree sets more reasonable.In addition, For all data groups, the series or the number of plies between the root node and leaf node of decision tree are fewer, say It is bright when being predicted using the decision tree to all data groups, server need the number of times of computing less, computing Speed is faster, illustrates that the Rule of judgment of the node of the decision tree sets more reasonable.
It is envisioned that the Rule of judgment of root node sets more reasonable, filtered out from all data groups Pure sample set in the quantity of the data group for belonging to first category that includes it is more.It is further contemplated that It is that it is more reasonable that every one-level of root node or each level of child nodes are set, from the data group by first category and the The number of the data group for belonging to first category filtered out in the mixing sample set that the data group of two classifications mixes Amount is more.
For influence of the setting to data component class of the Rule of judgment that quantifies node, draw in the embodiment of the present application Information gain is entered.
Defining information gain is:
Gains (U, V)=Ent (U)-Ent (U | V).
Wherein, Ent (U) is information U comentropy, and Ent (U | V) is the letter for information U after information V occur Entropy is ceased, comentropy represents the average uncertainty of information.
The mathematical definition of comentropy is:
Ent (U)=- ∑iP(ui)×log2P(ui)。
Wherein, P (U) is the probability distribution for information U occur, and information U specific value is with uiRepresent, P (ui) It is u to represent information U valuesiWhen probability, i takes natural number.
After there is information V, the probability distribution for information U occur is changed into P (U | V), and then information is averaged Uncertainty is changed into:
Ent (U | V)=- ∑iP(ui|vi)×log2P(ui|vi)。
Wherein, specific values of the information U after there is information V is (ui|vi), information U values are (ui|vi) when Probability be P (ui|vi), i takes natural number.
Aforementioned information gain is incorporated into the embodiment of the present application, for each data group ui, the data group uiThe probability for belonging to first category is P (ui).All data groups do not predict that comentropy during belonging kinds is Ent (U)=- ∑iP(ui)×log2P(ui).All data groups are entered using the Rule of judgment V of root node After row classification, the comentropies of all data groups can be characterized as Ent (U | V)=- ∑iP(ui|vi)× log2P(ui|vi).Root node can be quantified with information gain Gains (U, V)=Ent (U)-Ent (U | V) Influences of the Rule of judgment V to data component class.Based on same thinking, it can be quantified with information gain Influence of the Rule of judgment of the child nodes at different levels of root node to data component class.Further, carried in the application In the another embodiment supplied, the decision tree for being classified to data group is trained, is specifically included:
Travel through each characteristic attribute and characteristic value corresponding with characteristic attribute of the node of decision tree;
Using the characteristic attribute and the characteristic value as the first Rule of judgment, according to all data groups whether First Rule of judgment is met, all data groups are included into the pure sample being made up of the data group of first category respectively The mixing sample set that the data group of this set, the data group by first category and second category mixes;
Calculate all data groups it is unfiled when first information entropy;
Calculate the second comentropy when all data groups are classified according to first Rule of judgment;
The difference of both the first information entropy and the second comentropy is determined, as from the characteristic attribute and institute State information gain when characteristic value is classified as the first Rule of judgment to all data groups;
The characteristic attribute and the characteristic value of the selection with maximum information gain, are used as the root section of decision tree The Rule of judgment of point.
Further, in the another embodiment that the application is provided, train for being classified to data group Decision tree, in addition to:
Travel through decision tree root node the characteristic attribute outside each other characteristic attribute and and its Other corresponding characteristic values of his characteristic attribute;
Using other described characteristic attributes and other described characteristic values as the second Rule of judgment, from by the first kind In the mixing sample set that the data group of other data group and second category mixes, select and meet described second and sentence The data group of broken strip part is included into first category;
The mixing sample set mixed by the data group of first category and the data group of second category is calculated not divide The 3rd comentropy during class;
Calculate the mixing sample set that is mixed by the data group of first category and the data group of second category according to The 4th comentropy during the second Rule of judgment classification;
The difference of both the 3rd comentropy and the 4th comentropy is determined, as from other described characteristic attributes With other described characteristic values as the second Rule of judgment to the data group and the number of second category by first category Information gain during the mixing sample sets classification mixed according to group;
Selection other characteristic attributes and other described characteristic values with maximum information gain, are used as decision tree The Rule of judgment of the child node of root node.
Still by taking the credit record of above-named user's first as an example.Assuming that data group includes such credit Record:" user name:First;House property:2;Marital status:It is married;Monthly income:5000 yuan;It is overdue to go back Money duration:1 month;Overdue refund number of times:1……”.
The credit record includes user name, marital status, monthly income, overdue refund duration, overdue refund time Multiple characteristic attributes such as number.
Server travels through possible decision tree spatial configuration decision tree, it is, server is special using each Levy attribute and characteristic value builds decision tree.It can also regard as, obtain information V, in other words, obtain spy Levy the information of attribute and its characteristic value.Server uses characteristic attribute and its characteristic value as the first Rule of judgment, First Rule of judgment whether is met according to all data groups, all data groups are included into by first category respectively The pure sample set that constitutes of data group, the data group of the data group by first category and second category mixes Mixing sample set.
In training set, because all data groups have determined that belonging kinds.Before not classifying to data, The data group of first category and the data group of second category are in the state of mixing, with first information entropy Ent(U)。
According to the first Rule of judgment by all data groups be included into respectively by the data group of first category constitute it is pure After the mixing sample set that the data group of sample set, the data group by first category and second category mixes, Data group in first category is the data of overdue refund.By the data group and second category of first category It is contaminated with belonging to the data group of first category in the mixing sample set that data group mixes and belongs to second category Data group.It is overall that there is the second comentropy after being classified with the first Rule of judgment to all data groups Ent(U|V)。
Server determines the difference of both first information entropy and the second comentropy, as from characteristic attribute and its spy Information gain Gains (U, V) when value indicative is classified as the first Rule of judgment to all data groups= Ent(U)-Ent(U|V)。
Characteristic attribute and its characteristic value of the server selection with maximum information gain, are used as the root section of decision tree The Rule of judgment of point.
Then, each other characteristic attribute outside the characteristic attribute of the root node of server traversal decision tree Other characteristic values corresponding with other characteristic attributes.Server is belonged to using other characteristic attributes with other features Other corresponding characteristic values of property are as the second Rule of judgment, from the data group by first category and second category In the mixing sample set that data group mixes, the data group for selecting the second Rule of judgment of satisfaction is included into first category. In not further classification, the mixing sample mixed by the data group of first category and the data group of second category Set has the 3rd comentropy.And when mixing for being mixed by the data group of first category and the data group of second category When conjunction sample set is further classified according to the second Rule of judgment, with the 4th comentropy.Server determines The difference of both three comentropies and the 4th comentropy, as from other characteristic attributes and its other characteristic value conducts Second Rule of judgment is to the mixing sample collection that is mixed by the data group of first category and the data group of second category Close information gain during classification.Server selection with maximum information gain other characteristic attributes and its other Characteristic value, is used as the Rule of judgment of the child node of the root node of decision tree.
Server continues in the way of information gain is maximum, constantly division generation root node child nodes at different levels. When information gain is zero, or information gain be less than certain numerical value when, the division of Stop node, formation is determined The leaf node of plan tree tip.That is, number can be predicted by the Rule of judgment of the node of decision tree Some leaf node is belonged to according to group.Generally, the data group of same leaf node is belonged to, belongs to same Individual classification.In the case of some permissions, the data group of same leaf node is belonged to, it is allowed to include minority The data group of foreign peoples.
In the prior art, sample can be specifically the credit record of certain user, for example, " user Name:First;House property:2;Marital status:It is married;Monthly income:5000 yuan;Overdue refund duration:1 Month;Overdue refund number of times:1……”.
In the embodiment of the present application, sample can be specifically all credit records of same user, for example, wrap Include:The credit record that first last month normally refunds, the credit record of first this month overdue refund.That is, Sample is a data group.Data group can include " user name:First;House property:2;Marital status: Wedding;Monthly income:5000 yuan;Overdue refund duration:0 month;Overdue refund number of times:0 ... " and " use Name in an account book:First;House property:2;Marital status:It is married;Monthly income:5000 yuan;Overdue refund duration:3 Individual month;Overdue refund number of times:The a plurality of credit record such as 1 ... ".
In the prior art, the credit record of collection user's first last month is as sample and gathers user's first this month Credit record may produce opposite work as sample to the Rule of judgment of some internal node of decision tree With.
And in the embodiment of the present application, as long as the event of overdue refund occurred once for user, data group is returned Category classification is uniquely set to overdue refund class, so that, Rule of judgment of the data group to the node of decision tree Influence be unidirectional.The Rule of judgment optimization that this randomness due to collecting sample is brought can be solved The problem of uncertainty in direction, it is, the problem of characteristic value should be tuned up or turned down in Rule of judgment, So that Rule of judgment sets more reasonable.
In the prior art, it is assumed that user's first also has such credit record " user name:First;House property: 2;Marital status:It is married;Monthly income:5000 yuan;Overdue refund duration:1 month;Overdue refund time Number:1……”.An important credit is generally selected in both this credit record and a upper credit record The mode of record.It is specific for example, a upper credit record " ... overdue refund duration:3 months;Exceed Phase refund number of times:1 ... " than this credit record " ... overdue refund duration:1 month;It is overdue to go back Money number of times:1 ... " promise breaking degree is high, therefore, only with a upper credit record, is used as an instruction Practice the sample of decision tree.
And in the embodiment of the present application, because the event of overdue refund, the ownership class of data group occurred for user It is not set to overdue refund class uniquely.Data group is used as instruction as training sample with respect to a credit record Practice sample:The characteristic value that the Rule of judgment of the node of decision tree is related to can be finely tuned within the specific limits so that It is more reasonable that the model parameter of decision tree is set.Specifically show as the spy of the Rule of judgment of the node of decision tree Being adapted to property of value indicative is tuned up or turned down, so that so that during the classification belonged to using decision tree prediction data, Series after decision tree internal node is less, so that the time of forecast consumption is less.
S500:When receiving the request that the belonging kinds for treating prediction data are predicted, determined using described Plan tree predicts the classification of the attribution data to be predicted.
When receiving the request that the belonging kinds for treating prediction data are predicted, server uses decision tree Node Rule of judgment, predict the classification of attribution data to be sorted.
For each data to be sorted, the spy in the server Rule of judgment of trade-off decision root vertex first Attribute and characteristic value are levied, the corresponding branch of the data to be sorted is determined.Then, continue according to each of decision tree The characteristic attribute and characteristic value of the Rule of judgment of level node, data to be sorted are finally belonged to the leaf of decision tree Child node.
In the embodiment of the present application, using a data group integrally as a training sample relative to a number According to the data in group as a training sample, the model parameter of the decision tree as operational model can be caused It is more reasonable to set, so as to improve decision tree as operational model to the accurate of the belonging kinds predictions of data Property, and when model parameter sets reasonable, when classifying to all data to be predicted, forecasting accuracy height, Series after decision tree nodes is few, and predicted time consumption is few.
Above is the side that the use computer that the embodiment of the present application is provided is predicted to the belonging kinds of data Method, based on same thinking, refer to Fig. 3, the application also provides a kind of belonging kinds to data and carried out The device 100 of prediction, including:
Receiving module 11, for obtaining some data for having determined that belonging kinds;
Packetization module 12, for the data for characterizing same user will be packed into a data group;
Pretreatment module 13, for for each data group, data according to included in the data group Classification, determines the belonging kinds of the data group;
Modeling module 14, in the way of by a data group as a training sample, train for pair The decision tree that data group is classified;
Prediction module 15, for when receiving the request that the belonging kinds for treating prediction data are predicted, The classification of the attribution data to be predicted is predicted using the decision tree.
Further, in the another embodiment that the application is provided, the pretreatment module 13 is used for:
When there are the data for belonging to first category in all data included in data group, determine that data group is returned The classification of category is first category;
When all data included in data group belong to second category, the classification of data group ownership is determined For second category.
Further, in the another embodiment that the application is provided, the data are the credit data of user;
The first category is overdue refund class;
The second category is normal refund class.
Further, in the another embodiment that the application is provided, the data have the feature of multiple dimensions Attribute and characteristic value corresponding with characteristic attribute;
The modeling module 14 is used for:
Travel through each characteristic attribute and characteristic value corresponding with characteristic attribute of the node of decision tree;
Using the characteristic attribute and the characteristic value as the first Rule of judgment, according to all data groups whether First Rule of judgment is met, all data groups are included into the pure sample being made up of the data group of first category respectively The mixing sample set that the data group of this set, the data group by first category and second category mixes;
Calculate all data groups it is unfiled when first information entropy;
Calculate the second comentropy when all data groups are classified according to first Rule of judgment;
The difference of both the first information entropy and the second comentropy is determined, as from the characteristic attribute and institute State information gain when characteristic value is classified as the first Rule of judgment to all data groups;
The characteristic attribute and the characteristic value of the selection with maximum information gain, are used as the root section of decision tree The Rule of judgment of point.
Further, in the another embodiment that the application is provided, the modeling module 14 is additionally operable to:
Travel through decision tree root node the characteristic attribute outside each other characteristic attribute and and its Other corresponding characteristic values of his characteristic attribute;
Using other described characteristic attributes and other described characteristic values as the second Rule of judgment, from by the first kind In the mixing sample set that the data group of other data group and second category mixes, select and meet described second and sentence The data group of broken strip part is included into first category;
The mixing sample set mixed by the data group of first category and the data group of second category is calculated not divide The 3rd comentropy during class;
Calculate the mixing sample set that is mixed by the data group of first category and the data group of second category according to The 4th comentropy during the second Rule of judgment classification;
The difference of both the 3rd comentropy and the 4th comentropy is determined, as from other described characteristic attributes With other described characteristic values as the second Rule of judgment to the data group and the number of second category by first category Information gain during the mixing sample sets classification mixed according to group;
Selection other characteristic attributes and other described characteristic values with maximum information gain, are used as decision tree The Rule of judgment of the child node of root node.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or meter Calculation machine program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or knot The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the present invention can be used Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) on the form of computer program product implemented.
The present invention is with reference to the production of method according to embodiments of the present invention, equipment (system) and computer program The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and / or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion Formula processor or the processor of other programmable numerical value processing equipments are to produce a machine so that pass through and calculate The instruction of the computing device of machine or other programmable numerical value processing equipments is produced for realizing in flow chart one The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable numerical value processing to set In the standby computer-readable memory worked in a specific way so that be stored in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow or multiple of flow chart The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable numerical value processing equipments, made Obtain and perform series of operation steps on computer or other programmable devices to produce computer implemented place Reason, so that the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/defeated Outgoing interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory (RAM). Internal memory is the example of computer-readable medium.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by appointing What method or technique realizes that information is stored.Information can be computer-readable instruction, value structure, program Module or other numerical value.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its Random access memory (RAM), read-only storage (ROM), the electrically erasable of his type are read-only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic Disk storage or other magnetic storage apparatus or any other non-transmission medium, can be calculated available for storage The information that equipment is accessed.Defined according to herein, computer-readable medium does not include temporary computer-readable matchmaker The numerical signal and carrier wave of body (transitory media), such as modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to non-row His property is included, so that process, method, commodity or equipment including a series of key elements not only include Those key elements, but also other key elements including being not expressly set out, or also include for this process, Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, by sentence " including One ... " key element that limits, it is not excluded that in the process including the key element, method, commodity or set Also there is other identical element in standby.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the application can be used wherein includes calculating one or more Machine usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM, Optical memory etc.) on the form of computer program product implemented.
Embodiments herein is the foregoing is only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.All institutes within spirit herein and principle Any modification, equivalent substitution and improvements of work etc., should be included within the scope of claims hereof.

Claims (10)

1. a kind of method that belonging kinds to data are predicted, it is characterised in that including:
Obtain some data for having determined that belonging kinds;
The data for characterizing same user are packed into a data group;
For each data group, the classification of the data according to included in the data group determines the data group Belonging kinds;
In the way of a data group is as a training sample, train for determining for being classified to data group Plan tree;
It is pre- using the decision tree when receiving the request that the belonging kinds for treating prediction data are predicted Survey the classification of the attribution data to be predicted.
2. the method as described in claim 1, it is characterised in that the number according to included in the data group According to classification, determine the data group ownership classification, specifically include:
When there are the data for belonging to first category in all data included in data group, determine that data group is returned The classification of category is first category;
When all data included in data group belong to second category, the classification of data group ownership is determined For second category.
3. method as claimed in claim 2, it is characterised in that the data are the credit data of user;
The first category is overdue refund class;
The second category is normal refund class.
4. the method as described in claim 1, it is characterised in that the data have the spy of multiple dimensions Levy attribute and characteristic value corresponding with characteristic attribute;
The decision tree for being classified to data group is trained, is specifically included:
Travel through each characteristic attribute and characteristic value corresponding with characteristic attribute of the node of decision tree;
Using the characteristic attribute and the characteristic value as the first Rule of judgment, according to all data groups whether First Rule of judgment is met, all data groups are included into the pure sample being made up of the data group of first category respectively The mixing sample set that the data group of this set, the data group by first category and second category mixes;
Calculate all data groups it is unfiled when first information entropy;
Calculate the second comentropy when all data groups are classified according to first Rule of judgment;
The difference of both the first information entropy and the second comentropy is determined, as from the characteristic attribute and institute State information gain when characteristic value is classified as the first Rule of judgment to all data groups;
The characteristic attribute and the characteristic value of the selection with maximum information gain, are used as the root section of decision tree The Rule of judgment of point.
5. method as claimed in claim 4, it is characterised in that train for classifying to data group Decision tree, in addition to:
Travel through decision tree root node the characteristic attribute outside each other characteristic attribute and and its Other corresponding characteristic values of his characteristic attribute;
Using other described characteristic attributes and other described characteristic values as the second Rule of judgment, from by the first kind In the mixing sample set that the data group of other data group and second category mixes, select and meet described second and sentence The data group of broken strip part is included into first category;
The mixing sample set mixed by the data group of first category and the data group of second category is calculated not divide The 3rd comentropy during class;
Calculate the mixing sample set that is mixed by the data group of first category and the data group of second category according to The 4th comentropy during the second Rule of judgment classification;
The difference of both the 3rd comentropy and the 4th comentropy is determined, as from other described characteristic attributes With other described characteristic values as the second Rule of judgment to the data group and the number of second category by first category Information gain during the mixing sample sets classification mixed according to group;
Selection other characteristic attributes and other described characteristic values with maximum information gain, are used as decision tree The Rule of judgment of the child node of root node.
6. the device that a kind of belonging kinds to data are predicted, it is characterised in that including:
Receiving module, for obtaining some data for having determined that belonging kinds;
Packetization module, for the data for characterizing same user will be packed into a data group;
Pretreatment module, for for each data group, the classification of the data according to included in the data group, Determine the belonging kinds of the data group;
Modeling module, in the way of by a data group as a training sample, is trained for data The decision tree that group is classified;
Prediction module, for when receiving the request that the belonging kinds for treating prediction data are predicted, making The classification of the attribution data to be predicted is predicted with the decision tree.
7. device as claimed in claim 6, it is characterised in that the pretreatment module is used for:
When there are the data for belonging to first category in all data included in data group, determine that data group is returned The classification of category is first category;
When all data included in data group belong to second category, the classification of data group ownership is determined For second category.
8. device as claimed in claim 7, it is characterised in that the data are the credit data of user;
The first category is overdue refund class;
The second category is normal refund class.
9. device as claimed in claim 7, it is characterised in that the data have the spy of multiple dimensions Levy attribute and characteristic value corresponding with characteristic attribute;
The modeling module is used for:
Travel through each characteristic attribute and characteristic value corresponding with characteristic attribute of the node of decision tree;
Using the characteristic attribute and the characteristic value as the first Rule of judgment, according to all data groups whether First Rule of judgment is met, all data groups are included into the pure sample being made up of the data group of first category respectively The mixing sample set that the data group of this set, the data group by first category and second category mixes;
Calculate all data groups it is unfiled when first information entropy;
Calculate the second comentropy when all data groups are classified according to first Rule of judgment;
The difference of both the first information entropy and the second comentropy is determined, as from the characteristic attribute and institute State information gain when characteristic value is classified as the first Rule of judgment to all data groups;
The characteristic attribute and the characteristic value of the selection with maximum information gain, are used as the root section of decision tree The Rule of judgment of point.
10. device as claimed in claim 9, it is characterised in that the modeling module is additionally operable to:
Travel through decision tree root node the characteristic attribute outside each other characteristic attribute and and its Other corresponding characteristic values of his characteristic attribute;
Using other described characteristic attributes and other described characteristic values as the second Rule of judgment, from by the first kind In the mixing sample set that the data group of other data group and second category mixes, select and meet described second and sentence The data group of broken strip part is included into first category;
The mixing sample set mixed by the data group of first category and the data group of second category is calculated not divide The 3rd comentropy during class;
Calculate the mixing sample set that is mixed by the data group of first category and the data group of second category according to The 4th comentropy during the second Rule of judgment classification;
The difference of both the 3rd comentropy and the 4th comentropy is determined, as from other described characteristic attributes With other described characteristic values as the second Rule of judgment to the data group and the number of second category by first category Information gain during the mixing sample sets classification mixed according to group;
Selection other characteristic attributes and other described characteristic values with maximum information gain, are used as decision tree The Rule of judgment of the child node of root node.
CN201610153303.6A 2016-03-17 2016-03-17 The method and device that the belonging kinds of data are predicted Pending CN107203774A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610153303.6A CN107203774A (en) 2016-03-17 2016-03-17 The method and device that the belonging kinds of data are predicted

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610153303.6A CN107203774A (en) 2016-03-17 2016-03-17 The method and device that the belonging kinds of data are predicted

Publications (1)

Publication Number Publication Date
CN107203774A true CN107203774A (en) 2017-09-26

Family

ID=59903937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610153303.6A Pending CN107203774A (en) 2016-03-17 2016-03-17 The method and device that the belonging kinds of data are predicted

Country Status (1)

Country Link
CN (1) CN107203774A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678531A (en) * 2017-09-30 2018-02-09 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
CN107704070A (en) * 2017-09-30 2018-02-16 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
CN107967572A (en) * 2017-12-15 2018-04-27 华中师范大学 A kind of intelligent server based on education big data
CN108076224A (en) * 2017-12-21 2018-05-25 广东欧珀移动通信有限公司 application control method, device and storage medium and mobile terminal
CN108109089A (en) * 2017-12-15 2018-06-01 华中师范大学 A kind of education can computational methods
CN108198268A (en) * 2017-12-19 2018-06-22 江苏极熵物联科技有限公司 A kind of production equipment data scaling method
CN108536787A (en) * 2018-03-29 2018-09-14 优酷网络技术(北京)有限公司 content identification method and device
CN109409419A (en) * 2018-09-30 2019-03-01 北京字节跳动网络技术有限公司 Method and apparatus for handling data
CN109784351A (en) * 2017-11-10 2019-05-21 财付通支付科技有限公司 Data classification method, disaggregated model training method and device
CN110033276A (en) * 2019-03-08 2019-07-19 阿里巴巴集团控股有限公司 It is a kind of for security strategy generation method, device and the equipment transferred accounts
WO2020042579A1 (en) * 2018-08-27 2020-03-05 平安科技(深圳)有限公司 Group classification method and device, electronic device, and storage medium
WO2020057301A1 (en) * 2018-09-21 2020-03-26 阿里巴巴集团控股有限公司 Method and apparatus for generating decision tree
CN111105266A (en) * 2019-11-11 2020-05-05 中国建设银行股份有限公司 Client grouping method and device based on improved decision tree
CN111191692A (en) * 2019-12-18 2020-05-22 平安医疗健康管理股份有限公司 Data calculation method and device based on decision tree and computer equipment
WO2020140662A1 (en) * 2019-01-02 2020-07-09 深圳壹账通智能科技有限公司 Data table filling method, apparatus, computer device, and storage medium
CN112184292A (en) * 2020-09-16 2021-01-05 中国农业银行股份有限公司河北省分行 Marketing method and device based on artificial intelligence decision tree
CN112883962A (en) * 2021-01-29 2021-06-01 北京百度网讯科技有限公司 Fundus image recognition method, device, apparatus, storage medium, and program product
CN113822309A (en) * 2020-09-25 2021-12-21 京东科技控股股份有限公司 User classification method, device and non-volatile computer-readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679777A (en) * 2013-12-02 2015-06-03 中国银联股份有限公司 Method and system for detecting fraudulent trading
CN104798043A (en) * 2014-06-27 2015-07-22 华为技术有限公司 Data processing method and computer system
CN104794195A (en) * 2015-04-17 2015-07-22 南京大学 Data mining method for finding potential telecommunication users changing cell phones

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679777A (en) * 2013-12-02 2015-06-03 中国银联股份有限公司 Method and system for detecting fraudulent trading
CN104798043A (en) * 2014-06-27 2015-07-22 华为技术有限公司 Data processing method and computer system
CN104794195A (en) * 2015-04-17 2015-07-22 南京大学 Data mining method for finding potential telecommunication users changing cell phones

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678531A (en) * 2017-09-30 2018-02-09 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
CN107704070A (en) * 2017-09-30 2018-02-16 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
US11422831B2 (en) 2017-09-30 2022-08-23 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Application cleaning method, storage medium and electronic device
CN107704070B (en) * 2017-09-30 2020-01-14 Oppo广东移动通信有限公司 Application cleaning method and device, storage medium and electronic equipment
CN109784351A (en) * 2017-11-10 2019-05-21 财付通支付科技有限公司 Data classification method, disaggregated model training method and device
CN107967572A (en) * 2017-12-15 2018-04-27 华中师范大学 A kind of intelligent server based on education big data
CN108109089A (en) * 2017-12-15 2018-06-01 华中师范大学 A kind of education can computational methods
CN108198268A (en) * 2017-12-19 2018-06-22 江苏极熵物联科技有限公司 A kind of production equipment data scaling method
CN108076224A (en) * 2017-12-21 2018-05-25 广东欧珀移动通信有限公司 application control method, device and storage medium and mobile terminal
CN108076224B (en) * 2017-12-21 2021-06-29 Oppo广东移动通信有限公司 Application program control method and device, storage medium and mobile terminal
CN108536787A (en) * 2018-03-29 2018-09-14 优酷网络技术(北京)有限公司 content identification method and device
WO2020042579A1 (en) * 2018-08-27 2020-03-05 平安科技(深圳)有限公司 Group classification method and device, electronic device, and storage medium
WO2020057301A1 (en) * 2018-09-21 2020-03-26 阿里巴巴集团控股有限公司 Method and apparatus for generating decision tree
CN109409419B (en) * 2018-09-30 2021-05-07 北京字节跳动网络技术有限公司 Method and apparatus for processing data
CN109409419A (en) * 2018-09-30 2019-03-01 北京字节跳动网络技术有限公司 Method and apparatus for handling data
WO2020140662A1 (en) * 2019-01-02 2020-07-09 深圳壹账通智能科技有限公司 Data table filling method, apparatus, computer device, and storage medium
CN110033276A (en) * 2019-03-08 2019-07-19 阿里巴巴集团控股有限公司 It is a kind of for security strategy generation method, device and the equipment transferred accounts
CN111105266B (en) * 2019-11-11 2023-10-27 建信金融科技有限责任公司 Client grouping method and device based on improved decision tree
CN111105266A (en) * 2019-11-11 2020-05-05 中国建设银行股份有限公司 Client grouping method and device based on improved decision tree
CN111191692B (en) * 2019-12-18 2022-10-14 深圳平安医疗健康科技服务有限公司 Data calculation method and device based on decision tree and computer equipment
CN111191692A (en) * 2019-12-18 2020-05-22 平安医疗健康管理股份有限公司 Data calculation method and device based on decision tree and computer equipment
CN112184292A (en) * 2020-09-16 2021-01-05 中国农业银行股份有限公司河北省分行 Marketing method and device based on artificial intelligence decision tree
CN113822309A (en) * 2020-09-25 2021-12-21 京东科技控股股份有限公司 User classification method, device and non-volatile computer-readable storage medium
CN113822309B (en) * 2020-09-25 2024-04-16 京东科技控股股份有限公司 User classification method, apparatus and non-volatile computer readable storage medium
CN112883962A (en) * 2021-01-29 2021-06-01 北京百度网讯科技有限公司 Fundus image recognition method, device, apparatus, storage medium, and program product
CN112883962B (en) * 2021-01-29 2023-07-18 北京百度网讯科技有限公司 Fundus image recognition method, fundus image recognition apparatus, fundus image recognition device, fundus image recognition program, and fundus image recognition program

Similar Documents

Publication Publication Date Title
CN107203774A (en) The method and device that the belonging kinds of data are predicted
CN107230108A (en) The processing method and processing device of business datum
CN105718490A (en) Method and device for updating classifying model
Park et al. Explainability of machine learning models for bankruptcy prediction
Li et al. Research and application of random forest model in mining automobile insurance fraud
CN107633455A (en) Credit estimation method and device based on data model
CN107633030A (en) Credit estimation method and device based on data model
EP3613003B1 (en) System and method for managing detection of fraud in a financial transaction system
CN112232833A (en) Lost member customer group data prediction method, model training method and model training device
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
Dbouk et al. Towards a machine learning approach for earnings manipulation detection
CN112365339A (en) Method for judging commercial value credit loan amount of small and medium-sized enterprises
Prorokowski Validation of the backtesting process under the targeted review of internal models: practical recommendations for probability of default models
CN107424026A (en) Businessman's reputation evaluation method and device
CN112950350B (en) Loan product recommendation method and system based on machine learning
KR102499182B1 (en) Loan regular auditing system using artificia intellicence
Qiang et al. Relationship model between human resource management activities and performance based on LMBP algorithm
Pang et al. Wt model & applications in loan platform customer default prediction based on decision tree algorithms
CN111461932A (en) Administrative punishment discretion rationality assessment method and device based on big data
CN113641725A (en) Information display method, device, equipment and storage medium
Rodriguez et al. MobilityMirror: Bias-adjusted transportation datasets
CN113011748A (en) Recommendation effect evaluation method and device, electronic equipment and readable storage medium
Cope Modeling operational loss severity distributions from consortium data
CN112150276A (en) Training method, using method, device and equipment of machine learning model
CN112232945A (en) Method and device for determining personal customer credit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170926

RJ01 Rejection of invention patent application after publication