CN113177585A - User classification method and device, electronic equipment and storage medium - Google Patents

User classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113177585A
CN113177585A CN202110444073.XA CN202110444073A CN113177585A CN 113177585 A CN113177585 A CN 113177585A CN 202110444073 A CN202110444073 A CN 202110444073A CN 113177585 A CN113177585 A CN 113177585A
Authority
CN
China
Prior art keywords
user
variable
value
model
user characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110444073.XA
Other languages
Chinese (zh)
Other versions
CN113177585B (en
Inventor
张雯倩
刘慈文
李晓晓
常远芳
吴梦瑶
文芷晴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xiaotu Network Technology Co ltd
Original Assignee
Shanghai Xiaotu Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xiaotu Network Technology Co ltd filed Critical Shanghai Xiaotu Network Technology Co ltd
Priority to CN202110444073.XA priority Critical patent/CN113177585B/en
Publication of CN113177585A publication Critical patent/CN113177585A/en
Application granted granted Critical
Publication of CN113177585B publication Critical patent/CN113177585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a user classification method, a user classification device, electronic equipment and a storage medium, and belongs to the technical field of computers. The method comprises the steps of obtaining a plurality of user behavior information corresponding to users to be classified and a user characteristic variable corresponding to each user behavior information; converting the user characteristic variable into a first mode entering variable according to a preset conversion rule; inputting each first mode-entering variable into a corresponding sub-model in a behavior classification model according to user behaviors corresponding to user behavior information, so that each sub-model scores the input first mode-entering variable to obtain a plurality of behavior characteristic scores; performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation; and inputting the WOE values corresponding to the sub-boxes into a scoring card model to obtain the user categories of the users to be classified, so that the evaluation deviation of the user behaviors in different periods is small, and the repayment willingness of the users is effectively evaluated.

Description

User classification method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a user classification method and apparatus, an electronic device, and a storage medium.
Background
Most of existing user classification methods establish a scoring card model through data such as user basic information and user borrowing behaviors, and classify repayment types of users according to scores of the scoring card. Processing user information variables before modeling, including screening and binning: the screening mostly takes the IV value of the variable as the screening basis, and the binning mostly takes the modes of equal-frequency binning, equidistant binning or WOE binning and the like as the main modes. During modeling, more single machine learning models such as logistic regression and decision trees are mainly used. The granularity of results output by the variable processing mode and the model establishing mode is relatively coarse, and the classification of user behaviors in different periods has larger deviation.
Disclosure of Invention
Embodiments of the present invention provide a user classification method, an apparatus, an electronic device, and a storage medium, so as to solve the problem that the granularity of results output by the variable processing method and the model building method is relatively coarse, and the classification of user behaviors at different periods has a large deviation. The specific technical scheme is as follows:
in a first aspect, a user classification method is provided, including:
acquiring a plurality of user behavior information of a user to be classified, and determining a user characteristic variable based on the user behavior information;
converting each user characteristic variable into a first module entering variable according to a preset conversion rule;
inputting each first mode-entering variable into a corresponding sub-model in a behavior classification model according to the user behavior information, so that each sub-model scores the first mode-entering variable to obtain a plurality of behavior characteristic scores;
performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation;
and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified.
Optionally, the converting each user characteristic variable into a first modeling variable according to a preset conversion rule includes:
calculating the IV value of each user characteristic variable, sequencing a plurality of IV values, and selecting a plurality of user characteristic variables with the maximum IV values as first input variables;
ranking the importance of each user characteristic variable by using a random forest model, and selecting a user characteristic variable different from the first input variable from a plurality of user characteristic variables with the highest importance as a second input variable;
ranking the importance of each user characteristic variable by using a lightGBM model, and selecting a user characteristic variable different from the second input variable from a plurality of user characteristic variables with the highest importance as a third input variable;
and de-superposing the first input variable, the second input variable and the third input variable to obtain the first mode entering variable.
Optionally, the calculating an IV value of each of the user characteristic variables includes:
performing box separation on each user characteristic variable to obtain a plurality of boxes, wherein each box comprises one or more user characteristic variables;
calculating the WOE value of the user characteristic variable in each sub-box;
and calculating the IV value of each user characteristic variable aiming at the WOE value of the user characteristic variable in each sub-box.
Optionally, the calculating an IV value of each of the user characteristic variables further includes:
for each user characteristic variable, acquiring a first related characteristic variable associated with the user characteristic variable;
and combining the user characteristic variables and the first relevant characteristic variables to obtain new user characteristic variables, and performing box separation processing on the user characteristic variables to obtain a plurality of boxes.
Optionally, before the inputting the first in-mode variable into the behavior classification model, the method further includes:
inputting the first mold entering variable into an anti-fraud model to obtain an anti-fraud intention value;
comparing the fraud-resistant intention value with a preset fraud intention value;
and if the anti-fraud intention value is smaller than or equal to the preset fraud intention value, executing a step of inputting the first mode-entering variable into the behavior classification model.
Optionally, the method further comprises:
judging whether the user category of the user to be classified is in a preset user category set or not;
if the user category of the user to be classified is located in a preset user category set, outputting the user category of the user to be classified;
if the user category of the user to be classified is outside a preset user category set, inputting the user characteristic variable into an auxiliary model so that the auxiliary model performs box separation processing on the user characteristic variable;
judging whether the user characteristic variables in each sub-box have no corresponding variable values;
and if no corresponding variable value exists, executing a step of converting each user characteristic variable into a first mode entering variable according to a preset conversion rule.
Optionally, the method further comprises:
if the corresponding variable values do not exist, acquiring the variable values which do not exist in the branch boxes, establishing branch boxes containing the variable values which do not exist, and calculating the WOE values of the branch boxes containing the variable values which do not exist;
if the sub-box WOE value containing the non-corresponding variable value is smaller than or equal to a preset target box WOE value, assigning the non-corresponding variable value as a preset value, and executing the step of converting each user characteristic variable into a first mode-entering variable according to a preset conversion rule;
if the WOE value of the sub-box without the corresponding variable value is larger than the WOE value of the preset target box, acquiring a second relevant characteristic variable, wherein the second relevant characteristic variable is associated with each user characteristic variable without the corresponding variable value, performing sub-box processing on the second relevant characteristic variable to obtain a plurality of sub-boxes, each sub-box comprises one or more second relevant characteristic variables of the user characteristic variables, and executing the step of judging whether the user characteristic variables in each sub-box have no corresponding variable values.
In a second aspect, an apparatus for classifying a user is provided, including:
the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring a plurality of user behavior information corresponding to users to be classified and a user characteristic variable corresponding to each user behavior information;
the conversion module is used for converting the user characteristic variable into a first input variable according to a preset conversion rule;
the first model module is used for inputting each first model entering variable into a corresponding sub model in the behavior classification model according to the user behavior corresponding to the user behavior information, so that each sub model scores the input first model entering variable to obtain a plurality of behavior characteristic scores;
the computing module is used for performing box separation on the plurality of behavior characteristic scores and computing a WOE value corresponding to each box separation;
and the second model module is used for inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified.
Optionally, a conversion module comprising:
the first selection unit is used for calculating the IV value of each user characteristic variable, sequencing a plurality of IV values and selecting a plurality of user characteristic variables with the maximum IV values as first input variables;
the second selection unit is used for sorting the importance of each user characteristic variable by using a random forest model, and selecting a user characteristic variable different from the first input variable from a plurality of user characteristic variables with the highest importance as a second input variable;
a third selecting unit, configured to perform importance ranking on each user characteristic variable by using a lightGBM model, and select, as a third input variable, a user characteristic variable that is different from the second input variable from among a plurality of user characteristic variables with the highest importance;
and the merging unit is used for performing de-coincidence on the first input variable, the second input variable and the third input variable to obtain the first mode entering variable.
Optionally, the first selecting unit includes:
the first execution unit is used for performing box separation processing on the user characteristic variables to obtain a plurality of boxes, and each box comprises one or more user characteristic variables;
the second execution unit is used for calculating the WOE value of the user characteristic variable in each sub-box;
and the third execution unit is used for calculating the IV value of each user characteristic variable aiming at the WOE value of each user characteristic variable in each sub-box.
Optionally, the first selecting unit further includes:
a fourth execution unit, configured to, for each user characteristic variable, obtain a first relevant characteristic variable associated with the user characteristic variable;
and the fifth execution unit is used for combining the user characteristic variable and the first related characteristic variable to obtain a new user characteristic variable and executing the first execution unit.
Optionally, the apparatus further comprises:
the third model module input unit is used for inputting the first model entering variable into an anti-fraud model to obtain an anti-fraud intention value;
the third model module comparison unit is used for comparing the anti-fraud intention value with a preset fraud intention value;
and the third model module control unit is used for executing the first model module if the anti-fraud intention value is less than or equal to the preset fraud intention value.
Optionally, the apparatus further comprises:
the fourth model module is used for judging whether the user category of the user to be classified is in a preset user category set or not;
the fourth model module is used for outputting the user category of the user to be classified if the user category of the user to be classified is in a preset user category set;
a fourth model module, a sixth execution unit, configured to input the user feature variable into an auxiliary model if the user category of the user to be classified is outside a preset user category set, so that the auxiliary model performs binning processing on the user feature variable;
the fourth model module is used for judging whether the user characteristic variables in each box have no corresponding variable values;
and the fourth model module is used for executing the conversion module if the corresponding variable value does not exist.
Optionally, the apparatus further comprises:
a seventh execution unit of the fourth model module, configured to, if there is no corresponding variable value, obtain no corresponding variable value in each bin, establish a bin including no corresponding variable value, and calculate a WOE value of the bin including no corresponding variable value;
the eighth execution unit of the fourth model module is configured to assign the value of the variable that does not have a corresponding variable as a preset value if the value of the WOE of the sub-box that does not have the corresponding variable is less than or equal to the WOE value of the preset target box, and execute the conversion module;
a ninth execution unit of the fourth model module, configured to, if the WOE value of the sub-box including the non-corresponding variable value is greater than a preset target box WOE value, obtain a second relevant feature variable, where the second relevant feature variable is associated with each user feature variable corresponding to the non-corresponding variable value, perform binning processing on the second relevant feature variable to obtain a plurality of sub-boxes, where each sub-box includes one or more second relevant feature variables of the user feature variables, and execute the second judgment unit of the fourth model module.
In a third aspect, an electronic device is provided, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the first aspect when executing a program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the method steps of any of the first aspects.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a user classification method, a user classification device, electronic equipment and a storage medium, wherein a plurality of user behavior information corresponding to a user to be classified and a user characteristic variable corresponding to each user behavior information are obtained; converting the user characteristic variable into a first mode entering variable according to a preset conversion rule; inputting each first mode-entering variable into a corresponding sub-model in a behavior classification model according to user behaviors corresponding to user behavior information, so that each sub-model scores the input first mode-entering variable to obtain a plurality of behavior characteristic scores; performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation; and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified.
According to the embodiment of the invention, each first model entering variable is input into a corresponding sub model in a behavior classification model according to the user behavior corresponding to the user behavior information, so that each sub model scores the input first model entering variable to obtain a plurality of behavior characteristic scores; performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation; and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified. The behavior classification model scores the users through different submodels, so that the granularity of the output behavior characteristic scores is relatively fine, and the evaluation deviation of the scoring card model on the behaviors of the users in different periods is small.
Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of a user classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart of step S102 in FIG. 1;
fig. 3 is a schematic structural diagram of a user classifying device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Because the existing user classification method mostly establishes a scoring card model through data such as user basic information, user borrowing behaviors and the like, and classifies users according to scoring of the scoring card. Processing user information variables before modeling, including screening and binning: the screening mostly takes the IV value of the variable as the screening basis, and the binning mostly takes the modes of equal-frequency binning, equidistant binning or WOE binning and the like as the main modes. During modeling, more single machine learning models such as logistic regression and decision trees are mainly used. The granularity of results output by the variable processing mode and the model establishing mode is relatively coarse, and the classification of user behaviors in different periods has larger deviation.
Therefore, the invention provides a user classification method which can be applied to financial wind control products and can classify the repayment willingness of the user. Embodiments of the invention relate to multiple models: the system comprises an anti-fraud model, a behavior classification model, a scoring card model, an auxiliary model, a channel level model, an auxiliary algorithm model and the like, wherein the behavior classification model comprises a plurality of sub models.
The anti-fraud model outputs a fraud intention value aiming at an input user characteristic variable, if the fraud intention value is higher than a preset fraud intention value, a user corresponding to the user characteristic variable is determined as a user with strong fraud intention, the user with strong fraud intention can be filtered through the anti-fraud model, the user characteristic variable is determined based on user behavior information, the user behavior information refers to behavior information recorded by the user in historical transaction activities under the condition of user authorization, the user behavior information can refer to application information, borrowing information or repayment information and the like of transactions, and each user characteristic variable corresponds to one piece of user behavior information, for example: the user behavior information is "borrow for 2 times in month 1", "borrow for 3 times in month 2", and "borrow for 5 times in month 3", and the user characteristic variable is "borrow for each month", for example, the preset fraud intention value is 80%.
The behavior classification model scores users through different submodels, each first model entering variable is input into a corresponding submodel in the behavior classification model according to user behaviors corresponding to user behavior information, so that each submodel scores the input first model entering variable to obtain a plurality of behavior characteristic scores, the first model entering variable is formed by converting the user characteristic variables according to a preset conversion rule, the preset rule is that IV values of the user characteristic variables are firstly calculated, the IV values are sequenced, a plurality of user characteristic variables with the largest IV values are selected, the user characteristic variables are respectively input into a random forest model and a lightGBM model to obtain corresponding importance sequences, a plurality of highest and nonrepeated user characteristic variables are selected, and finally the user characteristic variables selected in three times are combined to be subjected to de-coincidence.
The IV value of the user characteristic variable refers to the information quantity or the information value of the user characteristic variable, and is used for screening the user characteristic variable when a model is constructed, and the calculation formula of the IV value is as follows:
IVi=(pyi-pni)*WOEi
where i denotes the ith group, WOEiA calculation formula representing the WOE value,
the WOE value is calculated as follows:
Figure BDA0003036207490000091
wherein i represents the i-th group, PyiIs the proportion of good users in the group (which refers to the individuals with the variable value of "yes" or 1 in the model) to all good users in all samples, PniIs the proportion of bad users in the group to all bad users in the sample, yiIs the number of good users in the group, niIs the number of bad users in the group, yTIs the number of all good users in the sample, nTThe number of all bad users in the sample is, and in the embodiment of the invention, the sample is the user characteristic variable corresponding to the user to be classified.
For example, if the user characteristic variable is "payment status", the payment is a good user, the characteristic variable "payment status" of the user is 1, the non-payment is a bad user, and the characteristic variable "payment status" of the user is 0.
The random forest model and the lightGBM model are used in the prior art.
The scoring card model is used for obtaining the user category of the user to be classified according to the input WOE value corresponding to each sub-box, wherein the WOE value corresponding to each sub-box is obtained by performing box processing on the behavior characteristic score output by each sub-model in the behavior classification model according to a calculation formula of the WOE value.
The auxiliary model carries out binning processing on the user characteristic variables, firstly judges whether the user characteristic variables in each bin have no corresponding variable values or not, if the bins have no corresponding variable values or not, the bins can be assigned to the user characteristic variables, if the bins have no corresponding variable values or not, obtains the no corresponding variable values in each bin, establishes bins containing the no corresponding variable values, calculates the WOE values of the bins, then compares the WOE values of the bins with the WOE values of a preset target bin, if the WOE values of the bins are smaller than or equal to the WOE values of the preset target bin, assigns the no corresponding variable values to the preset values, if the WOE values are larger than the WOE values of the preset target bin, obtains second related characteristic variables and carries out binning processing on the second related characteristic variables, and the second related characteristic variables are related to each user characteristic variable corresponding to which no corresponding variable value exists, and calculating the WOE value of the bin, and comparing the WOE value of the bin with the preset target WOE value again.
The channel level model obtains a score of a channel to which the user characteristic variable belongs, for an input user characteristic variable, where the channel to which the user characteristic variable belongs is a source of user behavior information corresponding to the user characteristic variable, for example, the source is "a bank" or "a financial institution".
The auxiliary algorithm model processes the user characteristic variables aiming at the input user characteristic variables and determines the relevance among the user characteristic variables, so that the output result of the model is more accurate.
A user classification method provided in an embodiment of the present invention will be described in detail below with reference to specific embodiments, as shown in fig. 1, the specific steps are as follows:
in the step S101, a plurality of user behavior information of a user to be classified is obtained, and a user characteristic variable is determined based on the user behavior information;
in practical applications, if a user needs a loan, information about past behavior of the user in participating transaction activities may be obtained, such as: application information for applying for a loan, borrowing information for borrowing from a person or an organization, or repayment information for repayment, and the like.
In the step, the user to be classified has a plurality of users, wherein a certain user needs to loan, after receiving the user authorization information, the user obtains a plurality of user behavior information recorded by the user in the historical transaction activity, and determines the corresponding user characteristic variables according to the obtained plurality of user behavior information, thereby obtaining a plurality of user characteristic variables corresponding to the plurality of users to be classified respectively.
Illustratively, the users to be classified include a user a and a user B, and the obtained user characteristic variables include "50 for a user a for a 1-month debit amount", 1 for a user a for a 1-month debit amount ", 150 for a user B for a 1-month debit amount, and 2 for a user B for a 1-month debit amount.
In step S102, converting each user characteristic variable into a first mode entering variable according to a preset conversion rule;
illustratively, according to a plurality of user characteristic variables acquired in step S101, for example, a user "repayment amount" is 50, "repayment number" is 5, "borrowing amount" is 100, "borrowing number" is 2; and after the user characteristic variables are converted according to a preset conversion rule, a plurality of user characteristic variables with the front sequencing positions can be selected and taken out, and the user characteristic variables of the users to be classified corresponding to the selected plurality of user characteristic variables are taken as first module-entering variables. For example, if the user characteristic variables at the top of the selected sort position are "repayment amount", "repayment count", and "borrowing amount", respectively, the first entry variable is, for example, 50 for user a, "repayment amount", "5 for repayment count", 100 for borrowing amount ", 150 for user B," 2 for repayment amount ", 100 for borrowing amount", and the like.
Step S103, inputting each first mode-entering variable into a corresponding sub-model in a behavior classification model according to user behavior information, so that each sub-model scores the first mode-entering variable to obtain a plurality of behavior characteristic scores;
in the embodiment of the present invention, the behavior classification model includes a plurality of sub-models, each sub-model corresponds to a type of user behavior information, for example: the first sub-model corresponds to application information in the user behavior information, and the second sub-model corresponds to common debt information in the user behavior information.
In the step, according to the user behavior information, a plurality of different first model entering variables are input into corresponding submodels in the behavior classification model, each submodel processes the input first model entering variables, and after the processing is finished, the submodels output scores corresponding to the user behavior information, so that a plurality of behavior characteristic scores are obtained.
Illustratively, according to the common-liability information in the user behavior information, the first module-entering variables obtained in steps S101 and S102 are "repayment amount" of 50, "repayment frequency" of 5, "borrowing amount" of 100, B user "repayment amount" of 150, "repayment frequency" of 2, "borrowing amount" of 100, and the like, and the common-liability information in the second sub-model corresponding to the user behavior information is input to obtain the common-liability behavior feature score of the a user, for example, the common-liability behavior feature score of the a user is 15%.
Illustratively, the behavior classification model may refer to a logistic regression algorithm, and the numerical range of the behavior feature score may be 10% to 25%.
In step S104, a plurality of behavior characteristic scores are subjected to box separation, and a WOE value corresponding to each box separation is calculated;
in this step, the plurality of behavior feature scores obtained in step S103 are first subjected to binning, for example, equidistant binning may be used to evenly bin the behavior feature scores from the maximum to the minimum into a plurality of bins, and then the WOE value corresponding to each bin is calculated according to the above calculation formula of the WOE value.
Illustratively, through the plurality of behavior feature scores obtained in step S103, the common debt behavior feature score of the user a is 15%, the application behavior feature score of the user B is 18%, and so on, these behavior scores are equally classified into two bins, for example, the two bins are classified into "10% -15%", "15% -20%", the user a is in the "10% -15%" bin, and the user B is in the "15% -20%" bin, and then the WOE value corresponding to each bin is calculated according to the above calculation formula of the WOE value.
In step S105, the WOE values corresponding to the bins are input into a scoring card model, so as to obtain the user category of the user to be classified.
In this step, the WOE values corresponding to the bins obtained in step S104 are first input into a scoring card model, then the scoring card model processes the multiple input WOE values, and after the processing, the scoring of each user is output, and each user is classified into different scoring intervals, so as to obtain the user category of the user to be classified.
Illustratively, the common debt behavior characteristic score of the user A is 15%, the application behavior characteristic score of the user B is 18%, and the score card model is input to obtain a score of 0.15 for the user A and a score of 0.58 for the user B. For example, if the user classification has a score interval of 0-0.5, 0.5-0.8, etc., the user a belongs to the category 0-0.5, and the user B belongs to the category 0.5-0.8. In practical application, the payment willingness of the user can be evaluated according to the rating interval where the user is, for example, the rating of the B user is 0.58, which can indicate that the payment willingness of the B user is stronger.
According to the embodiment of the invention, each first model entering variable is input into a corresponding sub model in a behavior classification model according to the user behavior corresponding to the user behavior information, so that each sub model scores the input first model entering variable to obtain a plurality of behavior characteristic scores; performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation; and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified. The behavior classification model scores the users through different submodels, so that the granularity of the output behavior characteristic scores is relatively fine, the evaluation deviation of the scoring card model on the behaviors of the users in different periods is small, and the repayment willingness of the users is effectively evaluated.
In order to make the user characteristic variable screening more efficient and find some variables with smaller iv value and higher importance at the same time to improve the model performance, as shown in fig. 2, in another embodiment of the present invention, the converting each of the user characteristic variables into the first modeling variable according to the preset conversion rule includes:
in step S201, calculating an IV value of each of the user characteristic variables, sorting the plurality of IV values, and selecting a plurality of user characteristic variables having the largest IV value as a first input variable;
in the step, the IV value of each user characteristic variable is calculated according to a calculation formula of the IV value to obtain a plurality of user characteristic variable IV values with different sizes, the user characteristic variable IV values are sorted in sequence from the largest to the smallest, and finally a plurality of user characteristic variable IV values positioned at the front in the sorting are selected to be used as first input variables.
Illustratively, the IV corresponding to the user characteristic variable "repayment amount" of the user to be classified is 0.4, the IV corresponding to the user characteristic variable "repayment frequency" of the user to be classified is 0.2, the IV corresponding to the "repayment amount" is 0.4, the IV corresponding to the "repayment frequency" is 0.2 after the IV values are sorted, and the IV corresponding to the "repayment amount" positioned at the top 50% "repayment amount" in the sorting is 0.4, which is selected as the first input variable.
In step S202, the importance of each user characteristic variable is sorted by using a random forest model, and a user characteristic variable different from the first input variable is selected from a plurality of user characteristic variables with the highest importance as a second input variable;
in this step, each user characteristic variable is first input into a random forest model, the random forest model may perform importance ranking on each input user characteristic variable, and the user characteristic variable with the highest importance is output first to obtain the importance ranking of the user characteristic variables, and then the user characteristic variable which is positioned at the front and is different from the first input variable is selected in the output importance ranking as the second input variable.
Illustratively, after a user characteristic variable 'repayment amount' of a user to be classified and a user characteristic variable 'borrowing times' of the user to be classified are input into the random forest model, the output importance sequence is that a corresponding value of the 'borrowing times' is 0.5 and a corresponding value of the 'repayment amount' is 0.3, and the corresponding value of the 'borrowing times' which is positioned in the first 50% of the sequence and is different from the first input variable 'repayment amount' is 0.5, which is taken as a second input variable.
In step S203, ranking the importance of each of the user characteristic variables by using a lightGBM model, and selecting a user characteristic variable different from the second input variable from among a plurality of user characteristic variables with the highest importance as a third input variable;
in this step, each of the user characteristic variables is first input into the lightGBM model, and the lightGBM model may perform importance ranking on each of the input user characteristic variables, and output the user characteristic variable with the highest importance first to obtain the importance ranking of the user characteristic variables, and then select the user characteristic variable that is located at the front and is different from the first input variable and the second input variable in the output importance ranking as the third input variable.
Illustratively, after a user characteristic variable "repayment amount" of a user to be classified, a user characteristic variable "borrowing frequency" of the user to be classified and a user characteristic variable "borrowing amount" of the user to be classified are input into the lightGBM model, the output importance is sorted into that a corresponding value of the "borrowing amount" is 0.6, a corresponding value of the "borrowing frequency" is 0.5 and a corresponding value of the "repayment amount" is 0.2, and the "borrowing amount" which is positioned in the top 50% of the user characteristic variables in the sorting and is different from the first input variable "repayment amount" and the second input variable "borrowing frequency" is selected as a third input variable.
In step S204, the first input variable, the second input variable, and the third input variable are subjected to de-coincidence, and the first mode-entering variable is obtained.
Illustratively, the IV corresponding to the first input variable "repayment amount" is 0.4, the "repayment frequency" IV is 0.2, the corresponding value of the second input variable "repayment frequency" is 0.2, the corresponding value of the third input variable "borrowing amount" is 0.3, the IV corresponding to the "repayment amount" is 0.4, the corresponding value of the "repayment frequency" is 0.2, the corresponding value of the "borrowing amount" is 0.3 after the superposition is removed, and the user characteristic variables of the to-be-classified users corresponding to the selected "repayment amount", "repayment frequency" and "borrowing amount" are used as the first model-entering variable.
According to the embodiment of the invention, each user characteristic variable is converted into the first modeling variable according to the preset conversion rule, so that the screening efficiency of the user characteristic variables is higher, and meanwhile, some variables with smaller iv value and higher importance can be found, so that the model performance is improved.
The IV value of the user characteristic variable is generally calculated by binning the WOE value of the user characteristic variable, and to this end, in a further embodiment of the present invention, the calculating the IV value of each of the user characteristic variables comprises:
performing box separation on each user characteristic variable to obtain a plurality of boxes, wherein each box comprises one or more user characteristic variables;
in this step, the user characteristic variables obtained in step S101 are subjected to binning processing, for example, equal frequency binning may be used, and the number of user characteristic variables in each bin is substantially equal.
Illustratively, the user characteristic variable is "repayment amount", and according to the obtained user behavior information, the repayment amount of the user a is 50 yuan, and the repayment amount of the user B is 200 yuan, the user characteristic variable "repayment amount" can be equally frequently binned into a first bin of 0-100 yuan, and a second bin of 100-200 yuan, wherein the user a is binned into the first bin, and the user B is binned into the second bin.
Calculating the WOE value of the user characteristic variable in each sub-box;
in this step, the WOE value of the user characteristic variable in each bin can be calculated according to the above WOE value calculation formula based on the binning condition of the previous step.
For example, as shown in table 1 below (data in the table is only used for explaining the embodiment of the present invention, and is not real data), there are 100000 users to be classified, and the payment amount information is selected according to the 100000 user behavior information, for example, the payment amount of the user a is 0 yuan, the payment amount of the user B is 150 yuan, and the like, then the user characteristic variable "payment amount" corresponding to the user a is 0, and the user characteristic variable "payment amount" corresponding to the user B is 150 yuan150, obtaining 4 sub-boxes after box separation processing, wherein the sub-boxes respectively comprise that the repayment amount is less than or equal to 100, the repayment amount is less than or equal to 200, the repayment users are obtained, the same are obtained by the users A, the repayment amount is less than or equal to 100, the repayment users are obtained, the user B, the repayment amount is less than or equal to 200, the repayment users are obtained, the same are obtained by repeating the steps, after the box separation is finished, the number of the repayment users is 10000, the number of the non-repayment users is 90000, the repayment amount is less than or equal to 100, the number of the repayment users is 2500, the number of the non-repayment users is 47500, and corresponding WOE (world Wide Web Environment) is calculated according to the WOE value calculation formula1=-0.74。
TABLE 1 user characteristic variable repayment amount situation table for user to be classified
Figure BDA0003036207490000161
And calculating the IV value of each user characteristic variable aiming at the WOE value of the user characteristic variable in each sub-box.
In this step, the IV value of each user characteristic variable is calculated from the WOE value of the user characteristic variable in each bin according to the calculation formula of the IV value of the user characteristic variable.
For example, see table 1 above, where "the repayment amount is less than or equal to 100" box, the number of repayment users is 2500, the number of non-repayment users is 47500, and the corresponding WOE is calculated according to the WOE value calculation formula10.74 of corresponding IV1=0.20。
The embodiment of the invention calculates the IV value of each user characteristic variable through the WOE value of the user characteristic variable sub-box.
In order to fully mine the relation between the user characteristic variables and make the granularity of the final output result finer, in another embodiment of the present invention, the calculating the IV value of each of the user characteristic variables further includes:
for each user characteristic variable, acquiring a first related characteristic variable associated with the user characteristic variable;
in the embodiment of the present invention, the first relevant characteristic variable refers to a user characteristic variable associated with the user characteristic variable, and the first relevant characteristic variable and the user characteristic variable are respectively used for indicating that the same characteristic variable is used in different time periods, and exemplarily, "30-day borrowing times" and "60-day borrowing times" are associated user characteristic variables.
And combining the user characteristic variables and the first relevant characteristic variables to obtain new user characteristic variables, and performing box separation processing on the user characteristic variables to obtain a plurality of boxes.
In this step, for example, by combining the user characteristic variable "borrowing success times within 60 days" and the first related characteristic variable "borrowing failure times within 60 days" a new user characteristic variable "borrowing success times within 60 days" and "borrowing and failure times within 60 days" can be obtained, the new user characteristic variable "borrowing success times within 60 days" and "borrowing and failure times within 60 days" are assigned to the user characteristic variable, and the step of performing binning processing on each of the user characteristic variables to obtain a plurality of bins is performed.
For example, in another embodiment of the present invention, the calculating the IV value of each user characteristic variable may further include, for each user characteristic variable, obtaining a maximum value, a minimum value, and a special value, where the maximum value and the minimum value are respectively a value range of 0% to 5% and a value range of 95% to 100% of a variable distribution interval, and the special value includes some special values such as 0, -1, and the like, and combining the user characteristic variable with the maximum value, the minimum value, and the special value, respectively, may obtain a new user characteristic variable, and for the new user characteristic variable, performing binning processing on each user characteristic variable to obtain a plurality of bins. For example, the distribution interval of the user characteristic variable "borrowing success frequency is less than or equal to 10" of the user to be classified is 95% -100%, the distribution interval of the user characteristic variable "borrowing failure frequency is less than or equal to 10" of the user to be classified is 50% -60%, and after the maximum value "borrowing success frequency is less than or equal to 10" is selected, the maximum value and the "borrowing success frequency within 60 days" are recombined to form the occupation ratio of the borrowing success frequency within 60 days of the new user to be classified as the user characteristic variable "borrowing success frequency within 60 days is less than or equal to 10".
For example, in another embodiment of the present invention, the calculating the IV value of each user characteristic variable may further include performing binning processing on each user characteristic variable to obtain a plurality of bins, obtaining a plurality of different categories of bins, cross-combining the bins into a new user characteristic variable, and performing binning processing on each user characteristic variable to obtain a plurality of bins for the new user characteristic variable. For example, if the first category of sub-boxes is acquired as "3 months of borrowing times" 6, and the second category of sub-boxes is acquired as "3 months of overdue times" 7, the cross-combination results in 6 × 7-42 sub-boxes.
The embodiment of the invention can fully mine the relation among the characteristic variables of the user, so that the granularity of the result output by the model is finer.
In order to filter fraudulent intended users and ensure the validity of the first incoming variable, in a further embodiment of the present invention, before the inputting the first incoming variable into the behavior classification model, the method further comprises:
inputting the first mold entering variable into an anti-fraud model to obtain an anti-fraud intention value;
for example, the anti-fraud model may refer to a logistic regression algorithm.
In this step, the first mold entry variable obtained in step S102 is input into an anti-fraud model, and the anti-fraud model processes the input first mold entry variable and outputs an anti-fraud intention value after the processing is completed. And if the input first modeling variables are multiple, obtaining multiple corresponding anti-fraud intention values.
Illustratively, the resulting fraud-countering intention value is 60%.
Comparing the fraud-resistant intention value with a preset fraud intention value;
in this step, the obtained fraud-countering intention values are respectively compared with preset fraud intention values.
And if the anti-fraud intention value is smaller than or equal to the preset fraud intention value, executing a step of inputting the first mode-entering variable into the behavior classification model.
In this step, for example, if the fraud intention of the anti-fraud model is 60% smaller than or equal to the preset fraud intention value of 80%, the fraud intention of the user is low, and the step of inputting the first modeling variable into the behavior classification model is performed.
According to the embodiment of the invention, the cheating intention users are filtered through the anti-cheating model, and the validity of the first mold entering variable is ensured.
In order to fully mine the connection between the user characteristic variables and make the granularity of the final output result finer, in another embodiment of the present invention, the method further comprises:
judging whether the user category of the user to be classified is in a preset user category set or not;
in this step, according to the user category of the user to be classified obtained in step S105, it is determined whether the user category is located in a preset user category set, for example, if the user category of the user to be classified is "0 to 0.5" and the preset user category set is "0.5 to 0.8", it may be determined that the user category is located outside the preset user category set; if the user category of the user to be classified is 0.5-0.8, it can be determined that the user category is located in the preset user category set.
If the user category of the user to be classified is located in a preset user category set, outputting the user category of the user to be classified;
in this step, the output user category of the user to be classified is consistent with the user category of the user to be classified obtained in step S105.
If the user category of the user to be classified is outside a preset user category set, inputting the user characteristic variable into an auxiliary model so that the auxiliary model performs box separation processing on the user characteristic variable;
in the embodiment of the invention, the auxiliary model is used for screening the user characteristic variables again. Illustratively, the auxiliary model may be a combination of a lightGBM model and GridSearchCV, where an algorithm executed by the lightGBM model is based on a decision tree algorithm, and does not need to perform too much manual intervention on variables, and automatically performs binning on user characteristic variables, and performs binning processing or assignment processing on variable values for which there is no correspondence; GridSearchCV can automatically adjust parameters, the main adjustment is the maximum depth max _ depth of a decision tree in a lightGBM model, the learning rate learning _ rate of the model, the number num _ leaves of leaves on one tree, the minimum number min _ child _ sample of data on one leaf and the like, each parameter needing to be adjusted can give a value range of the parameter, and GridSearchCV gives different parameter combinations according to the value ranges of different parameters, so that the time and the labor are greatly saved.
Judging whether the user characteristic variables in each sub-box have no corresponding variable values;
in this step, the auxiliary model performs binning processing on the user characteristic variables to obtain a plurality of bins, and then determines whether the user characteristic variables in the respective bins have no corresponding variable values, where for example, the user characteristic variable "borrowing number" of the user a is 1, and the user characteristic variable "borrowing number" of the user B is null, and the user characteristic variable "borrowing number" of the user B is a variable value having no corresponding value.
And if no corresponding variable value exists, executing a step of converting each user characteristic variable into a first mode entering variable according to a preset conversion rule.
In this step, if there is no corresponding variable value in the sub-box, the sub-box may be assigned to a user characteristic variable, and a step of converting each user characteristic variable into a first mode-entering variable according to a preset conversion rule is performed.
For example, if the sub-box of "repayment amount is less than or equal to 100", there are only two user characteristic variables, which are: if the user characteristic variable "repayment amount" of the user a is 50 and the user characteristic variable "repayment amount" of the user B is 0, the sub-box has no corresponding variable value, and the sub-box can be assigned to the user characteristic variable to be converted into the first module entering variable.
The embodiment of the invention can fully excavate the relation between the user characteristic variables, so that the granularity of the result output by the model is finer.
In order to fully mine the connection between the user characteristic variables and make the granularity of the final output result finer, in another embodiment of the present invention, the method further comprises:
if the corresponding variable values do not exist, acquiring the variable values which do not exist in the branch boxes, establishing branch boxes containing the variable values which do not exist, and calculating the WOE values of the branch boxes containing the variable values which do not exist;
in this step, the auxiliary model performs binning processing on the user characteristic variables to obtain a plurality of bins, determines whether the user characteristic variables in each bin have no corresponding variable values, acquires the variable values in the bins if the user characteristic variables in the bins have no corresponding variable values, sequentially obtains all the variable values without corresponding variable values, establishes bins containing the variable values without corresponding variable values, and calculates the WOE values of the bins containing the variable values without corresponding variable values according to a WOE value calculation formula.
Illustratively, the auxiliary model may be a combination of a lightGBM model and GridSearchCV, through which automatic binning may be implemented and through which automatic parameter tuning may be implemented.
For example, if the sub-box of "repayment amount is less than or equal to 100", there are only three user characteristic variables, which are: if the user characteristic variable "repayment amount" of the user a is 50, the user characteristic variable "repayment amount" of the user B is 0, and the user characteristic variable "borrowing frequency" of the user B is null, the user characteristic variable "borrowing frequency" of the user B is obtained as null, and the user characteristic variable "repayment amount" of the user a is obtained and distributed to the sub-boxes without the corresponding variable values.
If the sub-box WOE value containing the non-corresponding variable value is smaller than or equal to a preset target box WOE value, assigning the non-corresponding variable value as a preset value, and executing the step of converting each user characteristic variable into a first mode-entering variable according to a preset conversion rule;
in the embodiment of the present invention, the preset target box WOE value is a critical value where there is no corresponding variable value sub-box WOE value, and the preset value is used to replace the absence of the corresponding variable value.
In this step, if the value of the divided box WOE including the non-corresponding variable value is less than or equal to the critical value of the divided box WOE of the non-corresponding variable value, the non-corresponding variable value can be assigned to a preset value through the auxiliary model, and the value is used as a user characteristic variable, and the step of converting each user characteristic variable into a first input-mode variable according to a preset conversion rule is performed.
For example, the auxiliary model may be a combination of a lightGBM model and GridSearchCV, and values of variable values that do not correspond to each other may be assigned by the lightGBM model, and automatic parameter adjustment may be implemented by the GridSearchCV.
For example, the sub-boxes containing no corresponding variable value include a user characteristic variable "borrowing time" of a user being "empty", and the like, and if the WOE value of the sub-box containing no corresponding variable value is 0.25 and the preset target box WOE value is 0.5, the WOE value of the sub-box containing no corresponding variable value is smaller than the preset target box WOE value, the user characteristic variable "borrowing time" of the user being a is assigned to 5, the user characteristic variable "borrowing time" of the user being B is assigned to 6, and the user characteristic variable "borrowing time" of the user being a is 5 and the user characteristic variable "borrowing time" of the user being B is assigned to 6, so as to be converted into the first modeling variable.
If the WOE value of the sub-box without the corresponding variable value is larger than the WOE value of the preset target box, acquiring a second relevant characteristic variable, wherein the second relevant characteristic variable is associated with each user characteristic variable without the corresponding variable value, performing sub-box processing on the second relevant characteristic variable to obtain a plurality of sub-boxes, each sub-box comprises one or more second relevant characteristic variables of the user characteristic variables, and executing the step of judging whether the user characteristic variables in each sub-box have no corresponding variable values.
For example, in a sub-box including a user characteristic variable of a user, which includes "30-day borrowing times" of "empty" and the like, if the WOE value of the sub-box including the user characteristic variable of a user is 0.75 and the preset target box WOE value is 0.5, the WOE value of the sub-box including the user characteristic variable not including the corresponding variable value is greater than the preset target box WOE value, the borrowing times "of 60 days of the second relevant characteristic variable" of "30-day borrowing times" of "empty" of the user characteristic variable of a user is acquired as 10, the borrowing times "of 90 days of the second relevant characteristic variable" of "30-day" of "empty" of the user characteristic variable of B user is acquired as 8 and the like, and the borrowing times "of 60 days of" user characteristic variable of a user is processed as 10, and the borrowing times "90 days of" user characteristic variable of B user is processed as 8 in the sub-box, the WOE value is calculated according to a WOE value calculation formula and is compared with a preset target box WOE value of 0.5.
The embodiment of the invention can fully excavate the relation between the user characteristic variables, so that the granularity of the result output by the model is finer.
In order to score the channel to which the user behavior information belongs by using the user characteristic variable of the user to be classified so as to determine the reliability of the channel, in another embodiment of the present invention, the method further includes:
acquiring user characteristic variables corresponding to a plurality of user behavior information of a channel to be scored, wherein the user characteristic variables are associated;
converting the user characteristic variable into a second module entering variable according to a preset conversion rule;
and inputting the second model entering variable into the channel grade model to obtain the grade of the channel to be scored.
In the embodiment of the invention, user characteristic variables corresponding to a plurality of user behavior information of a channel to be scored are obtained and are associated, and the plurality of user characteristic variables can be exemplified by user characteristic variable combinations representing overdue rates, such as 'overdue times within 3 months', 'overdue amount within 3 months' and the like; converting the user characteristic variable into a second module entering variable according to a preset conversion rule; inputting the second modeling variable into a channel level model, wherein the exemplary channel level model may refer to a logistic regression algorithm, and outputting the level of the channel to be scored after being processed by the channel level model, for example, the score of the channel to which the user characteristic variable of the user a belongs is 0.85.
The embodiment of the invention can utilize the user characteristic variables of the users to be classified to grade the channels to which the user behavior information belongs so as to evaluate the reliability of the channels.
In order to determine the correlation between the user characteristic variables and make the output result of the model more accurate, in a further embodiment of the present invention, the method further comprises:
acquiring a plurality of user behavior information of a user to be classified, and determining a user characteristic variable based on the user behavior information;
converting each user characteristic variable into a first module entering variable according to a preset conversion rule;
and inputting the first mode entering variable into an auxiliary algorithm model to obtain a related variable of the first mode entering variable.
In the embodiment of the invention, a plurality of user behavior information of the user to be classified is obtained, and the user characteristic variable is determined to be the same as the step S101 based on the user behavior information; converting each user characteristic variable into a first module entering variable according to a preset conversion rule, wherein the step S102 is the same as the step S; and inputting a plurality of first mode-entering variables into an auxiliary algorithm model, wherein the auxiliary algorithm model can be referred to as a neural network algorithm, and relevant variables of the first mode-entering variables can be obtained after the auxiliary algorithm model is processed.
For example, after the first modeling variable "3 months overdue times" is processed by the auxiliary algorithm model, the related variable "3 months overdue amount" of the first modeling variable can be obtained.
The embodiment of the invention can determine the relevance between the user characteristic variables, so that the output result of the model is more accurate.
In practical applications, in order to implement the foregoing user classification method, there is provided a user classification apparatus, as shown in fig. 3, including:
the acquiring module 11 is configured to acquire a plurality of pieces of user behavior information corresponding to users to be classified and a user characteristic variable corresponding to each piece of user behavior information;
the conversion module 12 is configured to convert the user characteristic variable into a first modeling variable according to a preset conversion rule;
the first model module 13 is configured to input each of the first model-entering variables into a corresponding sub-model in the behavior classification model according to a user behavior corresponding to the user behavior information, so that each sub-model scores the input first model-entering variables to obtain a plurality of behavior feature scores;
a calculating module 14, configured to perform binning on the multiple behavior feature scores, and calculate a WOE value corresponding to each bin;
and the second model module 15 is configured to input the WOE value corresponding to each sub-box into a scoring card model, so as to obtain a user category of the user to be classified.
Optionally, a conversion module comprising:
the first selection unit is used for calculating the IV value of each user characteristic variable, sequencing a plurality of IV values and selecting a plurality of user characteristic variables with the maximum IV values as first input variables;
the second selection unit is used for sorting the importance of each user characteristic variable by using a random forest model, and selecting a user characteristic variable different from the first input variable from a plurality of user characteristic variables with the highest importance as a second input variable;
a third selecting unit, configured to perform importance ranking on each user characteristic variable by using a lightGBM model, and select, as a third input variable, a user characteristic variable that is different from the second input variable from among a plurality of user characteristic variables with the highest importance;
and the merging unit is used for performing de-coincidence on the first input variable, the second input variable and the third input variable to obtain the first mode entering variable.
Optionally, the first selecting unit includes:
the first execution unit is used for performing box separation processing on the user characteristic variables to obtain a plurality of boxes, and each box comprises one or more user characteristic variables;
the second execution unit is used for calculating the WOE value of the user characteristic variable in each sub-box;
and the third execution unit is used for calculating the IV value of each user characteristic variable aiming at the WOE value of each user characteristic variable in each sub-box.
Optionally, the first selecting unit further includes:
a fourth execution unit, configured to, for each user characteristic variable, obtain a first relevant characteristic variable associated with the user characteristic variable;
and the fifth execution unit is used for combining the user characteristic variable and the first related characteristic variable to obtain a new user characteristic variable and executing the first execution unit.
Optionally, the apparatus further comprises:
the third model module input unit is used for inputting the first model entering variable into an anti-fraud model to obtain an anti-fraud intention value;
the third model module comparison unit is used for comparing the anti-fraud intention value with a preset fraud intention value;
and the third model module control unit is used for executing the first model module if the anti-fraud intention value is less than or equal to the preset fraud intention value.
Optionally, the apparatus further comprises:
the fourth model module is used for judging whether the user category of the user to be classified is in a preset user category set or not;
the fourth model module is used for outputting the user category of the user to be classified if the user category of the user to be classified is in a preset user category set;
a fourth model module, a sixth execution unit, configured to input the user feature variable into an auxiliary model if the user category of the user to be classified is outside a preset user category set, so that the auxiliary model performs binning processing on the user feature variable;
the fourth model module is used for judging whether the user characteristic variables in each box have no corresponding variable values;
and the fourth model module is used for executing the conversion module if the corresponding variable value does not exist.
Optionally, the apparatus further comprises:
a seventh execution unit of the fourth model module, configured to, if there is no corresponding variable value, obtain no corresponding variable value in each bin, establish a bin including no corresponding variable value, and calculate a WOE value of the bin including no corresponding variable value;
the eighth execution unit of the fourth model module is configured to assign the value of the variable that does not have a corresponding variable as a preset value if the value of the WOE of the sub-box that does not have the corresponding variable is less than or equal to the WOE value of the preset target box, and execute the conversion module;
a ninth execution unit of the fourth model module, configured to, if the WOE value of the sub-box including the non-corresponding variable value is greater than a preset target box WOE value, obtain a second relevant feature variable, where the second relevant feature variable is associated with each user feature variable corresponding to the non-corresponding variable value, perform binning processing on the second relevant feature variable to obtain a plurality of sub-boxes, where each sub-box includes one or more second relevant feature variables of the user feature variables, and execute the second judgment unit of the fourth model module.
The user classification device provided by the embodiment of the invention acquires a plurality of user behavior information corresponding to the user to be classified and a user characteristic variable corresponding to each user behavior information; converting the user characteristic variable into a first mode entering variable according to a preset conversion rule; inputting each first mode-entering variable into a corresponding sub-model in a behavior classification model according to user behaviors corresponding to user behavior information, so that each sub-model scores the input first mode-entering variable to obtain a plurality of behavior characteristic scores; performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation; and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified.
The embodiment of the invention can evaluate the user behavior in different periods with smaller deviation, and effectively evaluate the repayment willingness of the user.
Based on the same technical concept, the embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 61, a communication interface 62, a memory 63, and a communication bus 64, where the processor 61, the communication interface 62, and the memory 63 complete mutual communication through the communication bus 64,
a memory 63 for storing a computer program;
the processor 61 is configured to implement the user classification method of the foregoing method embodiment when executing the program stored in the memory 63.
In the electronic device provided by the embodiment of the present invention, the processor implements the embodiment of the present invention by executing the program stored in the memory, and the electronic device provided by the embodiment of the present invention obtains a plurality of pieces of user behavior information corresponding to the user to be classified and a user characteristic variable corresponding to each piece of user behavior information; converting the user characteristic variable into a first mode entering variable according to a preset conversion rule; inputting each first mode-entering variable into a corresponding sub-model in a behavior classification model according to user behaviors corresponding to user behavior information, so that each sub-model scores the input first mode-entering variable to obtain a plurality of behavior characteristic scores; performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation; and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified.
The embodiment of the invention can evaluate the user behavior in different periods with smaller deviation, and effectively evaluate the repayment willingness of the user.
The communication bus mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The memory may include a Random Access Memory (RAM) or a Non-volatile memory (NVM), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The processor may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
In a further embodiment provided by the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, which computer program is executed by a processor to implement the user classification method steps of the aforementioned method embodiments.
The computer-readable storage medium provided by the embodiment of the present invention stores a computer program, and the computer program is executed by a processor to implement the embodiment of the present invention, and the computer-readable storage medium provided by the embodiment of the present invention obtains a plurality of user behavior information corresponding to a user to be classified and a user characteristic variable corresponding to each of the user behavior information; converting the user characteristic variable into a first mode entering variable according to a preset conversion rule; inputting each first mode-entering variable into a corresponding sub-model in a behavior classification model according to user behaviors corresponding to user behavior information, so that each sub-model scores the input first mode-entering variable to obtain a plurality of behavior characteristic scores; performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation; and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified.
According to the embodiment of the invention, each first model entering variable is input into a corresponding sub model in a behavior classification model according to the user behavior corresponding to the user behavior information, so that each sub model scores the input first model entering variable to obtain a plurality of behavior characteristic scores; performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation; and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified. The behavior classification model scores the users through different submodels, so that the granularity of the output behavior characteristic scores is relatively fine, the evaluation deviation of the scoring card model on the behaviors of the users in different periods is small, and the repayment willingness of the users is effectively evaluated.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (ssd)), among others.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for classifying a user, the method comprising:
acquiring a plurality of user behavior information of a user to be classified, and determining a user characteristic variable based on the user behavior information;
converting each user characteristic variable into a first module entering variable according to a preset conversion rule;
inputting each first mode-entering variable into a corresponding sub-model in a behavior classification model according to the user behavior information, so that each sub-model scores the first mode-entering variable to obtain a plurality of behavior characteristic scores;
performing box separation on the plurality of behavior characteristic scores, and calculating a WOE value corresponding to each box separation;
and inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified.
2. The method according to claim 1, wherein the converting each of the user characteristic variables into a first modeling variable according to a preset conversion rule comprises:
calculating the IV value of each user characteristic variable, sequencing a plurality of IV values, and selecting a plurality of user characteristic variables with the maximum IV values as first input variables;
ranking the importance of each user characteristic variable by using a random forest model, and selecting a user characteristic variable different from the first input variable from a plurality of user characteristic variables with the highest importance as a second input variable;
ranking the importance of each user characteristic variable by using a lightGBM model, and selecting a user characteristic variable different from the second input variable from a plurality of user characteristic variables with the highest importance as a third input variable;
and de-superposing the first input variable, the second input variable and the third input variable to obtain the first mode entering variable.
3. The method of claim 2, wherein said calculating the IV value for each of said user characteristic variables comprises:
performing box separation on each user characteristic variable to obtain a plurality of boxes, wherein each box comprises one or more user characteristic variables;
calculating the WOE value of the user characteristic variable in each sub-box;
and calculating the IV value of each user characteristic variable aiming at the WOE value of the user characteristic variable in each sub-box.
4. The method of claim 3, wherein said calculating an IV value for each of said user characteristic variables further comprises:
for each user characteristic variable, acquiring a first related characteristic variable associated with the user characteristic variable;
and combining the user characteristic variables and the first relevant characteristic variables to obtain new user characteristic variables, and performing box separation processing on the user characteristic variables to obtain a plurality of boxes.
5. The method of claim 1, wherein prior to inputting the first in-mode variable into the behavior classification model, the method further comprises:
inputting the first mold entering variable into an anti-fraud model to obtain an anti-fraud intention value;
comparing the fraud-resistant intention value with a preset fraud intention value;
and if the anti-fraud intention value is smaller than or equal to the preset fraud intention value, executing a step of inputting the first mode-entering variable into the behavior classification model.
6. The method of claim 1, further comprising:
judging whether the user category of the user to be classified is in a preset user category set or not;
if the user category of the user to be classified is located in a preset user category set, outputting the user category of the user to be classified;
if the user category of the user to be classified is outside a preset user category set, inputting the user characteristic variable into an auxiliary model so that the auxiliary model performs box separation processing on the user characteristic variable;
judging whether the user characteristic variables in each sub-box have no corresponding variable values;
and if no corresponding variable value exists, executing a step of converting each user characteristic variable into a first mode entering variable according to a preset conversion rule.
7. The method of claim 6, further comprising:
if the corresponding variable values do not exist, acquiring the variable values which do not exist in the branch boxes, establishing branch boxes containing the variable values which do not exist, and calculating the WOE values of the branch boxes containing the variable values which do not exist;
if the sub-box WOE value containing the non-corresponding variable value is smaller than or equal to a preset target box WOE value, assigning the non-corresponding variable value as a preset value, and executing the step of converting each user characteristic variable into a first mode-entering variable according to a preset conversion rule;
if the WOE value of the sub-box without the corresponding variable value is larger than the WOE value of the preset target box, acquiring a second relevant characteristic variable, wherein the second relevant characteristic variable is associated with each user characteristic variable without the corresponding variable value, performing sub-box processing on the second relevant characteristic variable to obtain a plurality of sub-boxes, each sub-box comprises one or more second relevant characteristic variables of the user characteristic variables, and executing the step of judging whether the user characteristic variables in each sub-box have no corresponding variable values.
8. An apparatus for classifying a user, the apparatus comprising:
the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring a plurality of user behavior information corresponding to users to be classified and a user characteristic variable corresponding to each user behavior information;
the conversion module is used for converting the user characteristic variable into a first input variable according to a preset conversion rule;
the first model module is used for inputting each first model entering variable into a corresponding sub model in the behavior classification model according to the user behavior corresponding to the user behavior information, so that each sub model scores the input first model entering variable to obtain a plurality of behavior characteristic scores;
the computing module is used for performing box separation on the plurality of behavior characteristic scores and computing a WOE value corresponding to each box separation;
and the second model module is used for inputting the WOE value corresponding to each sub-box into a scoring card model to obtain the user category of the user to be classified.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.
CN202110444073.XA 2021-04-23 2021-04-23 User classification method, device, electronic equipment and storage medium Active CN113177585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110444073.XA CN113177585B (en) 2021-04-23 2021-04-23 User classification method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110444073.XA CN113177585B (en) 2021-04-23 2021-04-23 User classification method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113177585A true CN113177585A (en) 2021-07-27
CN113177585B CN113177585B (en) 2024-04-05

Family

ID=76924787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110444073.XA Active CN113177585B (en) 2021-04-23 2021-04-23 User classification method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113177585B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779275A (en) * 2021-09-18 2021-12-10 中国平安人寿保险股份有限公司 Feature extraction method, device and equipment based on medical data and storage medium
CN114240215A (en) * 2021-12-22 2022-03-25 中国建设银行股份有限公司 User loss of contact grade acquisition method and device and storage medium
CN115082079A (en) * 2022-08-22 2022-09-20 深圳市华付信息技术有限公司 Method and device for identifying associated user, computer equipment and storage medium
JP7479558B1 (en) 2023-10-31 2024-05-08 株式会社コロプラ Program and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109345368A (en) * 2018-08-22 2019-02-15 中国平安人寿保险股份有限公司 Credit estimation method, device, electronic equipment and storage medium based on big data
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning
CN110009479A (en) * 2019-03-01 2019-07-12 百融金融信息服务股份有限公司 Credit assessment method and device, storage medium, computer equipment
CN110046783A (en) * 2018-12-13 2019-07-23 阿里巴巴集团控股有限公司 Falsely use account recognition methods, device, electronic equipment and storage medium
CN110196797A (en) * 2019-06-06 2019-09-03 苏宁消费金融有限公司 Automatic optimization method and system suitable for credit scoring card system
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN111311402A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 XGboost-based internet financial wind control model
CN111738331A (en) * 2020-06-19 2020-10-02 北京同邦卓益科技有限公司 User classification method and device, computer-readable storage medium and electronic device
WO2020257782A1 (en) * 2019-06-21 2020-12-24 Inspectorio Inc. Factory risk estimation using historical inspection data
CN112200659A (en) * 2020-09-28 2021-01-08 深圳索信达数据技术有限公司 Method and device for establishing wind control model and storage medium
CN112529477A (en) * 2020-12-29 2021-03-19 平安普惠企业管理有限公司 Credit evaluation variable screening method, device, computer equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109345368A (en) * 2018-08-22 2019-02-15 中国平安人寿保险股份有限公司 Credit estimation method, device, electronic equipment and storage medium based on big data
CN110046783A (en) * 2018-12-13 2019-07-23 阿里巴巴集团控股有限公司 Falsely use account recognition methods, device, electronic equipment and storage medium
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning
CN110009479A (en) * 2019-03-01 2019-07-12 百融金融信息服务股份有限公司 Credit assessment method and device, storage medium, computer equipment
CN110196797A (en) * 2019-06-06 2019-09-03 苏宁消费金融有限公司 Automatic optimization method and system suitable for credit scoring card system
WO2020257782A1 (en) * 2019-06-21 2020-12-24 Inspectorio Inc. Factory risk estimation using historical inspection data
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN111311402A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 XGboost-based internet financial wind control model
CN111738331A (en) * 2020-06-19 2020-10-02 北京同邦卓益科技有限公司 User classification method and device, computer-readable storage medium and electronic device
CN112200659A (en) * 2020-09-28 2021-01-08 深圳索信达数据技术有限公司 Method and device for establishing wind control model and storage medium
CN112529477A (en) * 2020-12-29 2021-03-19 平安普惠企业管理有限公司 Credit evaluation variable screening method, device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KEQIN CHEN等: "Credit Fraud Detection Based on Hybrid Credit Scoring Model", 《PROCEDIA COMPUTER SCIENCE》, pages 2 - 8 *
周胜利;金苍宏;吴礼发;洪征;: "基于评分卡――随机森林的云计算用户公共安全信誉模型研究", 通信学报, no. 05, pages 413 - 152 *
黎玉华;: "信用评分卡模型的建立", 科技信息, no. 13, pages 464 - 465 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779275A (en) * 2021-09-18 2021-12-10 中国平安人寿保险股份有限公司 Feature extraction method, device and equipment based on medical data and storage medium
CN113779275B (en) * 2021-09-18 2024-02-09 中国平安人寿保险股份有限公司 Feature extraction method, device, equipment and storage medium based on medical data
CN114240215A (en) * 2021-12-22 2022-03-25 中国建设银行股份有限公司 User loss of contact grade acquisition method and device and storage medium
CN115082079A (en) * 2022-08-22 2022-09-20 深圳市华付信息技术有限公司 Method and device for identifying associated user, computer equipment and storage medium
JP7479558B1 (en) 2023-10-31 2024-05-08 株式会社コロプラ Program and system

Also Published As

Publication number Publication date
CN113177585B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN113177585B (en) User classification method, device, electronic equipment and storage medium
CN110009479B (en) Credit evaluation method and device, storage medium and computer equipment
CN109035003A (en) Anti- fraud model modelling approach and anti-fraud monitoring method based on machine learning
CN107194803A (en) A kind of P2P nets borrow the device of borrower's assessing credit risks
CN108665366A (en) Determine method, terminal device and the computer readable storage medium of consumer's risk grade
KR20180041174A (en) Risk Assessment Methods and Systems
CN105354210A (en) Mobile game payment account behavior data processing method and apparatus
CN109816509A (en) Generation method, terminal device and the medium of scorecard model
CN110689437A (en) Communication construction project financial risk prediction method based on random forest
CN104463673A (en) P2P network credit risk assessment model based on support vector machine
CN113327164A (en) Risk control method and device for futures trading and computer equipment
CN114139931A (en) Enterprise data evaluation method and device, computer equipment and storage medium
CN113919432A (en) Classification model construction method, data classification method and device
CN109191185A (en) A kind of visitor's heap sort method and system
Kun et al. Default identification of p2p lending based on stacking ensemble learning
CN115860924A (en) Supply chain financial credit risk early warning method and related equipment
CN114757397A (en) Bad material prediction method, bad material prediction device and electronic equipment
CN115170295A (en) Enterprise credit risk assessment processing method and device
CN114626940A (en) Data analysis method and device and electronic equipment
CN115860889A (en) Financial loan big data management method and system based on artificial intelligence
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN112529303A (en) Risk prediction method, device, equipment and storage medium based on fuzzy decision
CN113034264A (en) Method and device for establishing customer loss early warning model, terminal equipment and medium
CN111768290A (en) Method and device for determining risk weight coefficient of service
CN109472704A (en) Screening technique, terminal device and the medium of fund product neural network based

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant