CN109146569A

CN109146569A - A kind of communication user logout prediction technique based on decision tree

Info

Publication number: CN109146569A
Application number: CN201810998919.2A
Authority: CN
Inventors: 龙华; 王瑞; 邵玉斌; 杜庆治
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2019-01-04

Abstract

The communication user logout prediction technique based on decision tree that the present invention relates to a kind of, belongs to field of artificial intelligence.The present invention sorts by calculating the comentropy of class label attribute, the information gain of the entropy of each Attribute transposition subset, class label attribute, by each attribute according to its information gain size, obtains the attribute of maximum information gain；Secondly Bayesian formula is used, concentrates each attribute value to carry out weight judgement training data；Node is finally created with the attribute of maximum information gain, and with this attribute label, branch is created to each value of attribute, the maximum attribute value of weight connects next attribute and establishes customer churn Early-warning Model by the building of decision tree.

Description

A kind of communication user logout prediction technique based on decision tree

Technical field

The communication user logout prediction technique based on decision tree that the present invention relates to a kind of, belongs to field of artificial intelligence.

Background technique

Currently, the competition between major operator in the communications industry increasingly swashs with the continuous development of the communications industry Strong, mobile client losing issue is constantly subjected to the close attention of common carrier.Under new mobile Internet industrial situation, Other than the internal competition of common carrier, operator will also be faced with the external competitive from internet, spread out under the new era The mobile network's instant messaging tools born is that the business dependence of offer of the user to operator gradually weakens.

Due to communicating the characteristics such as set meal is many kinds of, evaluation criterion is single, user's history data are imperfect on the market at present, Requirement of the customer churn early warning problem for algorithm is very high, and conventional method mainly includes Fundamental Analysis and technology analytic approach, divides Customer churn is not analyzed by the market factor such as supply-demand relationship and statistical analysis, prediction difficulty is larger, and prediction result is quasi- True property is not high.

Data mining and flourishing for big data provide a large amount of technical support, face for provider customer's attrition prediction To user characteristic data, customer churn early warning mechanism is targetedly established, analyzes the factor of customer churn, foundation is accurately marketed Strategy takes corresponding adjustment for weak link, improves the market competitiveness of operator.

Summary of the invention

The communication user logout prediction technique based on decision tree that the technical problem to be solved in the present invention is to provide a kind of, is used for It solves the above problems.

The technical scheme is that a kind of communication user logout prediction technique based on decision tree, specific steps are as follows:

Step1, data acquisition: sample communications user base information and consumer consumption behavior are put into training data set S In；

Wherein communication user basic information includes: Subscriber Number, attribute party A-subscriber's age, when attribute B gender, attribute C open an account Between, attribute D customer grade, attribute E monthly consumption charge；

Consumer consumption behavior includes: the attribute F duration of call, attribute G flow dosage, attribute H short message dosage, attribute J increment Business dosage；

Step2, data processing: every generic attribute data that S is concentrated are classified；

Step3, class label characteristic value is divided into n class, wherein class label characteristic value has t value, t_uFor sample contained by every class Number, for given class label characteristic value, comentropy be may be defined as shown in formula (1):

Wherein

Any attribute extracted in attribute ABCDEFGHJ is concentrated from S, is constituted its any one subset and is denoted as S_k

(k=A, B, C, D, E, F, G, H, J), in subset S_kIn, S is divided into according to its tagsort_kjClass (j=1 ..., v), Wherein every one kind has S_kij(i=1 ..., m) a value；

The comentropy of each classification can be obtained according to classification value:

Step4, the entropy for calculating each Attribute transposition subset are as shown in formula (3):

Step5, the expectation reduced value that entropy is measured with information gain then select attribute k divide to S the information of acquisition Gain is as shown in formula (4):

Gain (k)=I (T₁,T₂,...,T_n)-Ent(k) (4)

Gain (k) causes the expectation of entropy to be compressed after representing known attribute k；

Step6, Bayesian formula is usedWherein (k=A, B, C, D, E, F, G, H, J) is to training data Each attribute value is concentrated to carry out weight judgement；

Each attribute is sorted according to its information gain size, obtains maximum information gain by Step7, building decision tree Attribute；Node is created, and with this attribute label, branch is created to each value of attribute；The maximum attribute value connection of weight is next A attribute；

Step8, according to constructed decision tree, establish customer churn Early-warning Model.

Further, the probability distribution of sample is more balanced in the Step3, then comentropy is bigger, and sample set mixes journey It spends also higher；Using comentropy as a measurement of training set degree of purity, entropy is smaller, and degree of purity is higher.

Further, Gain (k) causes the expectation of entropy to be compressed after representing known attribute k in the Step5；Comentropy is smaller Representing that node is purer, the definition based on information gain, information gain is bigger, and the reduction amount of comentropy is bigger, and node tends to be pure, Then Gain (k) is bigger, and the information for selecting testing attribute k to provide classification is more.

It is difficult to carry out data at profound place the beneficial effects of the present invention are: solving traditional data analysis tool Reason, the method combined by the decision Tree algorithms in data mining analysis with Bayesian formula, to magnanimity, huge, numerous Trivial, mixed and disorderly data are handled, analyze have potential using value communicating user data therefrom to communication user be lost into Row early warning improves prediction accuracy, increases the market competitiveness that quotient is used in communication.

Detailed description of the invention

Fig. 1 is flow chart of steps of the present invention.

Specific embodiment

With reference to the accompanying drawings and detailed description, the invention will be further described.

Embodiment 1: as shown in Figure 1, a kind of communication user logout prediction technique based on decision tree, by calculating class label The information gain of the comentropy of attribute, the entropy of each Attribute transposition subset, class label attribute, by each attribute according to its information The sequence of gain size, obtains the attribute of maximum information gain；Secondly Bayesian formula is used, each attribute is concentrated to training data Value carries out weight judgement；Node is finally created with the attribute of maximum information gain, and with this attribute label, to each of attribute Value creation branch, the maximum attribute value of weight connect next attribute and establish customer churn early warning mould by the building of decision tree Type.

Specific steps are as follows:

Step1, data acquisition, are put into training data set S for sample communications user base information and consumer consumption behavior In；

Consumer consumption behavior includes: the attribute F duration of call (the min/ month), attribute G flow dosage (the GB/ month), attribute H short message Dosage (item/moon), attribute J value-added service dosage (member/moon)；

Step2, data processing, every generic attribute data that S is concentrated, classify；

≤ 10 ,≤18 ,≤40 ,≤60, > 60 specifically, following four classes (year) is divided by age of user for attribute A:

Following two categories is divided by gender for attribute B: male, female

Following six class (year) is divided by the time of opening an account for attribute C :≤3 ,≤5 ,≤10 ,≤15 ,≤20, > 20

Following five class: a star user, two star users, three-star user, four stars is divided into for attribute D customer grade Grade user, five-star user

Following five class (member/moon) is divided by monthly consumption charge for attribute E :≤50 ,≤100 ,≤150 ,≤200, > 200

Following five class (minute/moon) is divided by duration of call expense for attribute F :≤300 ,≤500 ,≤1000 ,≤ 1500,2000 >

Attribute G flow dosage is divided into following seven class (the G/ month) :≤5G ,≤10G ,≤20G ,≤30G ,≤40G ,≤ 50G, > 50G

≤ 100 ,≤300 ,≤500 ,≤1000, > following seven class (item/moon) is divided into for attribute H short message dosage: 1000

It can be divided according to the actual situation, division rule is without being limited thereto；

Wherein

Any attribute extracted in attribute ABCDEFGHJ is concentrated from S, is constituted its any one subset and is denoted as S_k(k=A, B, C, D, E, F, G, H, J), in subset S_kIn, S is divided into according to its tagsort_kjClass (j=1 ..., v), wherein every one kind has S_kij(i=1 ..., m) a value；

Gain (k)=I (T₁,T₂,...,T_n)-Ent(k) (4)

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of communication user logout prediction technique based on decision tree, it is characterised in that:

Step1, data acquisition: sample communications user base information and consumer consumption behavior are put into training data set S；

Wherein communication user basic information includes: Subscriber Number, attribute party A-subscriber's age, attribute B gender, attribute C open an account the time, Attribute D customer grade, attribute E monthly consumption charge；

Consumer consumption behavior includes: the attribute F duration of call, attribute G flow dosage, attribute H short message dosage, attribute J value-added service Dosage；

Step3, class label characteristic value is divided into n class, wherein class label characteristic value has t value, t_uFor number of samples contained by every class, For given class label characteristic value, comentropy be may be defined as shown in formula (1):

Wherein

Any attribute extracted in attribute ABCDEFGHJ is concentrated from S, is constituted its any one subset and is denoted as S_k(k=A, B, C, D, E, F, G, H, J), in subset S_kIn, S is divided into according to its tagsort_kjClass (j=1 ..., v), wherein every one kind has S_kij(i =1 ..., m) a value；

Step5, the expectation reduced value that entropy is measured with information gain then select attribute k divide to S the information gain of acquisition For shown in such as formula (4):

Gain (k)=I (T₁,T₂,...,T_n)-Ent(k) (4)

Step6, Bayesian formula is usedWherein (k=A, B, C, D, E, F, G, H, J) concentrates training data Each attribute value carries out weight judgement；

Each attribute is sorted according to its information gain size, obtains the attribute of maximum information gain by Step7, building decision tree； Node is created, and with this attribute label, branch is created to each value of attribute；The maximum attribute value of weight connects next category Property；

2. the communication user logout prediction technique according to claim 1 based on decision tree, it is characterised in that: described The probability distribution of sample is more balanced in Step3, then comentropy is bigger, and the severity of mixing up of sample set is also higher；Using comentropy as One measurement of training set degree of purity, entropy is smaller, and degree of purity is higher.

3. the communication user logout prediction technique according to claim 1 based on decision tree, it is characterised in that: described Gain (k) causes the expectation of entropy to be compressed after representing known attribute k in Step5；The smaller node that represents of comentropy is purer, is based on information The definition of gain, information gain is bigger, and the reduction amount of comentropy is bigger, and node tends to be pure, then Gain (k) is bigger, and selection is surveyed It is more to try the information that attribute k provides classification.