CN116862078B

CN116862078B - Method, system, device and medium for predicting overdue of battery-change package user

Info

Publication number: CN116862078B
Application number: CN202311126893.XA
Authority: CN
Inventors: 李朝; 黄家明; 肖劼; 胡始昌; 杨建燮; 杨斌
Original assignee: Hangzhou Yugu Technology Co ltd
Current assignee: Hangzhou Yugu Technology Co ltd
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-12-12
Anticipated expiration: 2043-09-04
Also published as: CN116862078A

Abstract

The application discloses a method, a system, a device and a medium for predicting overdue of a battery-change package user, which comprise the following steps: acquiring a history sample set of a user; clustering is carried out on the basis of the historical sample set to determine an initial cluster set, and up-sampling is carried out on each initial cluster in the initial cluster set to determine a new sample set; training a fusion model constructed in advance based on a data set obtained by combining a new sample set and a historical sample set to obtain a prediction model; and according to the prediction model, performing user overdue prediction on the user data to be predicted to obtain overdue results of the battery-change package users. The application can increase the diversity of the sample set by expanding the random number range, avoid repeated data caused by excessive concentration of the samples, effectively reduce the generation of the repeated data and reduce the number of the noise samples at the synthesis boundary. Meanwhile, different information is learned by using different classification models through a fusion model structure, so that the accuracy and stability of prediction are improved.

Description

Method, system, device and medium for predicting overdue of battery-change package user

Technical Field

The application relates to the technical field of big data processing, in particular to a method, a system, a device and a medium for predicting overdue of a battery-change package user.

Background

At present, two methods are available for predicting the long-term overdue problem caused by long-term battery returning and charging after the battery replacement package purchased by a user expires. One is a rule-based approach that relies primarily on the user's behavior and credit information, such as long-term days without power and sesame credit. However, this method is only suitable for simple scenes, and has limited prediction effect on complex scenes. Another approach, based on supervised model classification, requires prediction by machine learning training models. The method can judge whether the user can be overdue or not according to the behavior data of the user, and give overdue probability. However, this approach suffers from data imbalance problems because the proportion of overdue users tends to be low, resulting in an imbalance in the distribution of data.

To address this problem, a common approach is to use SMOTE algorithm to handle the data imbalance, but the algorithm may amplify noise problems in the data. In addition, a single machine learning model has the problem of low prediction accuracy and stability in actual use.

Therefore, there is a need to reduce the data imbalance problem and improve the accuracy and stability of the predictive model.

Disclosure of Invention

The application aims to provide a method, a system, a device and a medium for predicting overdue of a battery-change package user, which at least solve the problems of unbalanced data and insufficient single model prediction precision in the related technology.

The first aspect of the present application provides a method for predicting overdue of a battery-change package user, the method comprising:

acquiring a historical sample set of a user, wherein the historical sample set comprises behavior data, consumption data and credit data of the user;

clustering is carried out on the basis of the historical sample set to determine an initial cluster set, and up-sampling is carried out on each initial cluster in the initial cluster set to determine a new sample set;

training a fusion model constructed in advance based on a data set obtained by combining a new sample set and a historical sample set to obtain a prediction model;

and according to the prediction model, performing user overdue prediction on the user data to be predicted to obtain overdue results of the battery-change package users.

In one embodiment, determining an initial set of clusters based on clustering of historical sample sets, upsampling each initial cluster in the initial set of clusters, determining a new sample set, includes:

clustering is carried out according to the historical sample set to obtain an initial cluster set, wherein each initial cluster in the initial cluster set comprises an initial minority sample set and an initial majority sample set;

Determining an imbalance rate of each initial cluster in the initial cluster set based on the initial minority class sample set and the initial majority class sample set;

screening the initial cluster set according to the unbalance rate and a preset threshold value interval to determine a target cluster set;

a new sample set is determined based on the center point samples and other samples of the target minority class sample set in the target cluster set.

In one embodiment, determining a new sample set based on the center point samples and other samples of the target minority class sample set in the target cluster set includes:

determining the sampling weight of each target cluster in the target cluster set based on the average distance between samples in the target minority class sample set;

determining the target number of new samples in the corresponding target cluster according to the sampling weight;

and generating new samples of the target quantity in each target cluster according to the center point samples and other samples by using a preset difference model, and obtaining a new sample set.

In one embodiment, training a pre-built fusion model based on a data set obtained by combining a new sample set and a historical sample set to obtain a prediction model comprises:

training a machine learning model in the fusion model based on the data set, and determining a target optimal model;

Determining a new data set by adopting five-fold cross validation on an optimal model based on the data set;

training the logistic regression model in the fusion model based on the new data set until a preset condition is met, and obtaining a prediction model.

In one embodiment, training a machine learning model in a fusion model based on a dataset, determining a target best model, includes:

training a machine learning model according to a training set in the data set, evaluating by adopting a verification set in the data set, and determining an optimal model according to an evaluation result;

and carrying out optimization treatment on the optimal model based on a Bayesian optimization algorithm to obtain a target optimal model.

In one embodiment, the new data set includes a new training set, a new validation set, and a new test set; training the logistic regression model in the fusion model based on the new data set until a preset condition is met, so as to obtain a prediction model, wherein the method comprises the following steps:

and training a logistic regression model according to the new training set, adopting a new test set to evaluate, and adopting a new verification set to adjust model parameters according to an evaluation result until a preset condition is met, so as to obtain a prediction model.

In one embodiment, the behavioral data samples include at least one of a distance travelled, a number of power changes, a time interval of a last power change;

The consumption data sample comprises at least one of the amount and the number of days of the last battery-change package purchased before the expiration of the user and whether the coupon is used by purchasing the battery-change package;

the credit data sample includes at least one of a user's eligibility for a mortgage, sesame credit, historical overdue conditions.

A second aspect of the present application provides a prediction system for overdue battery-change package users, the prediction system comprising:

the system comprises a history sample set acquisition module, a data processing module and a data processing module, wherein the history sample set acquisition module is used for acquiring a history sample set of a user, and the history sample set comprises behavior data, consumption data and credit data of the user;

the new sample set acquisition module is used for carrying out clustering processing on the basis of the historical sample set to determine an initial cluster set, carrying out up-sampling on each initial cluster in the initial cluster set, and determining a new sample set;

the prediction model acquisition module is used for training a fusion model constructed in advance based on a data set obtained by combining the new sample set and the historical sample set to obtain a prediction model;

and the user overdue result acquisition module is used for carrying out user overdue prediction on the user data to be predicted according to the prediction model to obtain the overdue result of the battery-changing package user.

The third aspect of the present application provides a device for predicting the overdue of a battery-change package user, comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for implementing the method for predicting the overdue of the battery-change package user when executing the executable codes.

A fourth aspect of the present application provides a computer readable storage medium having stored thereon a program which, when executed by a processor, implements the method of predicting expiration of a battery-change package user of any one of the above.

The method, the system, the device and the medium for predicting overdue of the battery-change package user have at least the following technical effects.

The application can increase the diversity of the sample set by expanding the random number range, avoid repeated data caused by excessive concentration of the samples, effectively reduce the generation of the repeated data and reduce the number of the noise samples at the synthesis boundary. Meanwhile, different information is learned by using different classification models through a fusion model structure, so that the accuracy and stability of prediction are improved. The application expands the random number range and uses the softmax function particularly in the SMOTE algorithm, can better control the distribution of the synthesized samples, reduce the generation of repeated data, reduce the number of noise samples and improve the accuracy and the stability of a machine learning model.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of a SMOTE algorithm that may generate noise samples in a majority type of sample region;

fig. 2 is a flowchart of a method for predicting overdue of a battery-change package according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of up-sampling to determine a new sample set according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of determining a new sample set according to an embodiment of the present application;

fig. 5 is a schematic flow chart of determining a new sample set in step S103 according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of determining an optimal model of a target according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of obtaining a prediction model according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating another method for predicting overdue of a battery-change package according to an embodiment of the present application;

FIG. 9 is a block diagram of a system for predicting expiration of a battery-change package user according to an embodiment of the present application;

fig. 10 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the described embodiments of the application can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," and similar referents in the context of the application are not to be construed as limiting the quantity, but rather as singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in connection with the present application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The current solutions to the overdue problem can be divided into two main categories: rule-based methods and supervised model classification-based methods.

Rule-based methods are typically judged by the user's behavior and credit, such as the number of days the user has not paid for over a long period of time and sesame credits. However, the rule method is not effective in solving the problem of overdue prediction of complex scenes.

The method based on supervised model classification requires training the model by using training data with overdue and non-overdue labels through a machine learning method, and finally a classification model capable of automatically detecting whether a user will be overdue or not is obtained. After the model inputs the behavior data of the user, the labels of whether the user can be overdue or not and the long-term overdue probability are returned. However, due to the extremely low proportion of overdue users in practice, the predictive performance of the model is severely affected by data imbalance. The currently common algorithm to solve the problem of data imbalance is SMOTE, but it runs the risk of amplifying data noise. In addition, a single machine learning model has the problem of insufficient prediction precision and stability in actual use.

The SMOTE algorithm is the most widely used up-sampling algorithm at present and is widely used in academia and industry. However, the SMOTE algorithm has a major problem in that noise in the data may be further amplified. Fig. 1 is a schematic diagram of the SMOTE algorithm that may generate noise samples in most types of sample regions, as shown in fig. 1, with the three samples generated by interpolation being considered noise samples due to upsampling between different decision boundaries. In addition, the SMOTE algorithm may generate a large amount of repeated data in the samples because random interpolation generates random numbers between 0 and 1, thereby increasing the probability of repeated data. When more artificial data is generated, more repeated data is generated, and the repeated data easily causes the model to be over-fitted.

In general, rule-based methods are applicable to simple scenarios, while supervised model classification-based methods are applicable to complex scenarios, but are all affected by data imbalance. Furthermore, SMOTE algorithms may amplify noise in the data. Therefore, there is a need to improve the shortcomings of SMOTE algorithms to reduce the effects of data imbalance.

Based on the above situation, the embodiment of the application provides a method, a system, a device and a medium for predicting overdue of a battery-change package user.

In a first aspect, an embodiment of the present application provides a method for predicting overdue of a battery-change package user, and fig. 2 is a flowchart of the method for predicting overdue of a battery-change package user provided in the embodiment of the present application, as shown in fig. 2, the method includes the following steps:

step S101, a history sample set of a user is obtained, wherein the history sample set comprises behavior data, consumption data and credit data of the user.

Step S102, clustering processing is carried out on the basis of the historical sample set to determine an initial cluster set, up-sampling is carried out on each initial cluster in the initial cluster set, and a new sample set is determined.

The embodiment of the application provides a central SMOTE with improved clustering, which can reduce sample points generated at decision boundaries by clustering historical sample sets and sampling in each cluster, thereby reducing the influence on a model. Meanwhile, new sample points are generated by random interpolation between class center points of the minority classes and minority class samples, and further reduction of sample points on decision boundaries is facilitated. The problem of sample unbalance is effectively solved, and the performance of the model on a few types of samples is improved. Through clustering and up-sampling, training data can be more balanced, and generation of sample points on decision boundaries is reduced. Meanwhile, new sample points are generated through interpolation, the number of few types of samples can be increased, and the generalization capability and the robustness of the model are improved.

With continued reference to FIG. 1, if each category is clustered and up-sampled, three noise samples in the graph will not be generated and will not be affected by the noise data when fitting the data using the supervised learning model. The application mainly clusters the original data and then generates all samples in each class so as to avoid generating new noise samples.

Fig. 3 is a schematic flow chart of up-sampling to determine a new sample set according to an embodiment of the present application, as shown in fig. 3, based on the flow chart shown in fig. 2, step S102 includes the following steps:

step S201, clustering is carried out according to the historical sample set to obtain an initial cluster set, wherein each initial cluster in the initial cluster set comprises an initial minority sample set and an initial majority sample set.

Clustering from the historical sample set is typically implemented using clustering algorithms such as K-means, DBSCAN, and the like. The clustering process can divide the history sample set into a plurality of clusters c ₁ ,c ₂ ,…,c _k So that minority class samples and majority class samples can be better distinguished. This helps to solve the sample imbalance problem and improves the performance of the model on a small number of classes of samples. Second, a similar sample can be takenThe initial cluster sets are formed by aggregation, so that the internal structure and characteristics of the data can be better captured, the mixing among categories can be reduced, and a more accurate basis is provided for generating new sample points for subsequent sampling and interpolation.

Step S202, determining the unbalance rate of each initial cluster in the initial cluster set based on the initial minority class sample set and the initial majority class sample set.

Calculate the imbalance ratio (ir) for each initial cluster:where i=1, 2,..k, major count (c _i ) Representing the number of majority class samples in each initial cluster, a minness count (c) _i ) Representing the number of minority class samples in each initial cluster. The imbalance ratio refers to the ratio difference between the number of minority class samples and the number of majority class samples. The unbalance degree of the sample set can be intuitively known by calculating the unbalance rate of each initial cluster in the initial cluster set. The unbalance rate is used as a basis for adjusting the sampling and interpolation proportion. The balance of the sample set is further optimized, the performance of the model on few classes is improved, and the method can be better suitable for data distribution and characteristics in the overdue prediction scene of the battery-change package user.

Step S203, the initial cluster set is screened according to the unbalance rate and a preset threshold value interval, and a target cluster set is determined.

Firstly, all data are clustered, unbalance rate of each cluster is calculated, and sampling effect is improved by removing clusters which are unnecessary to sample. In the embodiment of the present application, in step S203, there are two preset thresholds: irt1 and irt. Wherein irt is used to reject clusters with a minority class of samples not less than a majority class of samples, because such clusters do not require sampling; irt2 is used to remove clusters with a very small number of minority class samples, which can result in distortion of the generated samples. Further, the threshold is a value greater than zero. Finally, the target cluster set after screening is reserved.

Step S204, a new sample set is determined based on the center point samples and other samples of the target minority sample set in the target cluster set.

The target minority sample set comprises a center point sample, and the rest samples are other samples.

The intra-cluster average distance of a few class sample sets in the reserved clusters is calculated based on the embodiment of the application, and the corresponding sampling weight is determined according to the distance. The algorithm determines the new number of samples to be generated for each cluster based on the total number of samples N to be generated. And finally, generating a corresponding number of samples.

By this improved method, the sampling problem of unbalanced data sets can be better handled. By eliminating unnecessary clusters and determining the sampling weight based on average distance in the clusters, the algorithm can generate synthetic samples which are more in line with the distribution of the actual minority samples, thereby improving the learning effect of the model on minority.

Fig. 4 is a schematic flow chart of determining a new sample set according to an embodiment of the present application, as shown in fig. 4, on the basis of the flow chart shown in fig. 3, step S204 includes the following steps:

step 301, determining a sampling weight of each target cluster in the target cluster set based on an average distance between samples in the target minority sample set.

Calculating average distance between samples in a target minority class sample set _j Where j refers to the number of the filtered cluster, j=1, 2. Euclidean distance, manhattan distance, or other distance measurement may be used. Then calculating the sampling weight of each target cluster in the target cluster set. The weight is calculated based on the average distance between samples in the target minority sample set, and the number of generated samples can be adjusted according to the characteristics of each target cluster. And further, the distribution of the synthesized samples can be controlled more accurately, so that the characteristics of the real data can be reflected better, and the generalization capability of the model on few types of samples is improved.

Step S302, determining the target number of new samples in the corresponding target cluster according to the sampling weight.

Determining the number of samples n in each target cluster _j =w _j *(majority_count(c _j )-minority_count(c _j ) And) wherein majority_count (c) _j ) Representing the number of majority class samples in each target cluster, minness_count (c) _j ) And expressing the number of the minority class samples in each target cluster, and calculating the target number of new samples which should be generated in the target cluster according to the sampling weight of each target cluster and the number of the majority class and minority class samples. The number of generated samples of different clusters can be weighted by multiplying the sampling weight, so that the sampling requirement of a few classes can be better met.

Step S303, generating new samples of target quantity in each target cluster according to the center point samples and other samples by using a preset difference model, and obtaining a new sample set.

The new sample x generated _new =x+softmax(t)*(x _j -x), wherein t is taken from a random number between-10 and 10, x _j Is the center point of the minority class samples in the target cluster j, and x represents the selected nearest neighbor minority class sample.

It should be noted that, the acquisition mode of the selected nearest-neighbor minority class sample is as follows: calculating center points x of minority class samples in target cluster _j An average or weighted average can be calculated as the center point x _j . Selecting and centering point x _j Where k is a superparameter, preferably k is 5, according to the center point x _j And interpolating the minority class samples in the target cluster with the selected nearest neighbor minority class samples x to generate new samples. Random interpolation is performed between the minority class sample center point and the minority class sample to generate new samples, so as to reduce the number of the synthesized boundary noise samples.

With continued reference to fig. 2, step S103 is performed after step S102, as follows.

And step S103, training a fusion model constructed in advance based on the data set obtained by combining the new sample set and the historical sample set to obtain a prediction model.

Fig. 5 is a schematic flow chart of determining a new sample set in step S103 according to an embodiment of the present application, as shown in fig. 5, on the basis of the flow shown in fig. 2, step S103 includes the following steps:

and step S401, training a machine learning model in the fusion model based on the data set, and determining a target optimal model.

In one embodiment, data set X includes a training set, a validation set, and a test set.

The dataset was divided into three parts, training set (60%), validation set (20%) and test set (20%), respectively, and trained using common machine learning models including decision trees, random forests, neighborhood algorithms (KNNs), gaussian naive bayes, gradient lifting decision trees, edit regression, XGBoost, lightgbm, etc.

Fig. 6 is a schematic flow chart of determining an optimal model of a target according to an embodiment of the present application, as shown in fig. 6, on the basis of the flow chart shown in fig. 5, step S401 includes the following steps:

step S501, training a machine learning model according to a training set in the data set, evaluating by adopting a verification set in the data set, and determining an optimal model according to an evaluation result.

Predicting the verification set by using the trained model, calculating performance indexes of the model on the verification set, such as accuracy, F1 score and the like, and selecting a model with F1 score and recovery greater than 0.6 as an optimal model according to the performance indexes of each model on the verification set.

And step S502, optimizing the optimal model based on a Bayesian optimization algorithm to obtain a target optimal model.

The Bayesian optimization algorithm is used for searching optimal parameters, and determining a hyper-parameter range and a search space which need to be optimized when the optimal model is optimized. An evaluation function is created for training and evaluating the model according to a given hyper-parameter setting. The evaluation function should receive the hyper-parameters as input and return performance metrics of the model on the validation set, such as accuracy, F1 score, etc. A bayesian optimizer object is initialized using a library (e.g., hyperopt, optuna, etc.) based on a bayesian optimization algorithm and specifying the hyper-parameter space to be optimized. In each iteration, the optimizer will select a new hyper-parameter combination and evaluate based on the previous results. An appropriate number of iterations is set, as well as the number of hyper-parameter combinations evaluated per iteration. And after the iteration is finished, acquiring a super-parameter combination with the highest performance index from the optimizer as super-parameter setting of the target optimal model. Finally, the model is retrained and evaluated on the test set using the target optimal hyper-parameter combination to obtain the final performance index. The bayesian optimization algorithm belongs to a conventional optimization algorithm, and the embodiment of the application utilizes the algorithm to perform optimization processing on the optimal model to obtain a target optimal model, a specific optimization process is not particularly limited here, and the optimization process can be adjusted along with an application scene and is not repeated here.

With continued reference to fig. 5, step S402 is performed after step S401, as follows.

Step S402, determining a new data set by adopting five-fold cross validation on the optimal model based on the data set.

The five-fold cross validation method comprises the following steps: the dataset was equally divided into training, validation and test sets, and each was divided into five equal parts. The training set was fitted using a five-fold cross-validation approach. Each fit generates a new set of training, validation and test set data. In each cross-validation, four of them were used as training sets and the remaining one as validation sets. Using the best model, fits were made on these four training sets and predictions were made for the validation set. The above steps were repeated five times, ensuring that each data was used as a validation set once. And finally, merging the training set, the verification set and the original test set data in each cross verification to obtain complete new training set, verification set and test set data. Through the process, the data can be fully utilized for model evaluation and selection, and a brand new data set is generated for training and testing the model so as to evaluate the performance of the model more accurately.

And step S403, training the logistic regression model in the fusion model based on the new data set until a preset condition is met, so as to obtain a prediction model.

In one embodiment, the new data set includes a new training set, a new validation set, and a new test set.

Fig. 7 is a schematic flow chart of obtaining a prediction model according to an embodiment of the present application, as shown in fig. 7, on the basis of the flow chart shown in fig. 5, step S403 includes the following steps:

and step S601, training a logistic regression model according to the new training set, adopting a new test set for evaluation, and adopting a new verification set for model parameter adjustment according to an evaluation result until a preset condition is met, so as to obtain a prediction model.

The new training set is input into a logistic regression model, which is fitted to predict the class labels of overdue users. The formula used in the logistic regression model is:where e is a constant, called natural base, is a fixed value, about 2.718281828459024523536, X ₁ 、X ₂ 、X ₃ 、X ₄ 、X ₅ For the output value of the target best model in step S502, f is the bias value, a, b, c, d, m is the coefficient fitted by the logistic regression model, and finally the probability value is normalized and output by softmax, the threshold of the probability value can be set to 0.5, the probability value is greater than or equal to 0.5 and is marked as overdue user (tag is 1), and the probability value is less than 0.5 and is marked as non-overdue user (tag is 0). And predicting the new test set by using the trained logistic regression model, and evaluating according to the prediction result and the real label. And performing model parameter adjustment and logistic regression parameter adjustment by adopting a new verification set according to the evaluation result, such as regularization coefficient and the like, so as to obtain better performance indexes, and retraining the logistic regression model according to the optimal parameters selected on the verification set so as to obtain optimal performance and obtain a prediction model. Fitting, predicting and parameter tuning are carried out by using a logistic regression model, a simple and efficient prediction model can be quickly constructed, and the model is continuously optimized according to feedback of a data set, so that a more accurate overdue user prediction result is provided.

With continued reference to fig. 2, step S104 is performed after step S103, as follows.

And step S104, according to the prediction model, performing user overdue prediction on the user data to be predicted to obtain overdue results of the battery-change package users.

The user data to be predicted comprises behavior data, consumption data and credit data of the user, wherein the behavior data comprises at least one of riding distance, power conversion times and battery replacement failure in preset days; the consumption data comprises at least one of the amount and the number of days of the last battery-change package purchased before the expiration of the user and whether the coupon is used by purchasing the battery-change package; the credit data includes at least one of a user's eligibility for a mortgage, sesame credit, historical overdue conditions.

And predicting the predicted user data by using a prediction model to obtain a label (0 or 1), namely predicting the overdue result of the battery-change package user.

Fig. 8 is a flowchart of another method for predicting overdue of a battery-change package according to an embodiment of the present application, as shown in fig. 8, including the following steps:

1. fetch and tag: data extraction and label setting are carried out according to user behaviors: most users choose to renew or return the battery within three days of the package expiration. But most of users who have not returned the battery within thirty days after the package has expired are unlikely to actively return the battery or renew the fee. Therefore, it is necessary to extract user data that neither renews nor returns the battery three days after the expiration of the package, and mark users that have not returned the battery or purchased a new package thirty days after the expiration of the package as overdue users, and other users as non-overdue users.

2. And (3) data extraction: extracting behavior data of a user, including the following features: riding distance, number of power changes, time interval of last power change, etc.

Extracting consumption data of a user, including the following features: the last meal amount purchased before the expiration of the customer and the number of days from the expiration, and whether to use coupons.

Extracting credit data of a user, including the following features: whether the user qualifies for deposit avoidance, sesame credit score, historical overdue conditions, and the like.

3. Upsampling the data: first, user data is clustered to find clusters in the data. Then, upsampling within each cluster may reduce problems due to data imbalance. Ensuring that each cluster has enough samples to perform efficient modeling and analysis without undue reliance on certain clusters or samples. Furthermore, to further reduce the sample points generated on the decision boundary, a new sample point may be generated using interpolation methods. The specific practice is to conduct random interpolation between class center points of the minority classes and minority class samples. Doing so can increase the number of samples of a minority class and make it more closely resemble the sample distribution of other classes, thereby reducing the impact on model decision boundaries. The SMOTE algorithm of the clustering center comprises the following steps:

a) Clustering all user data samples to obtain cluster c ₁ ,c ₂ ,…,c _k 。

b) Calculating the immalance ratio (imbalance ratio) for each cluster:where i=1, 2,..k, major count (c _i ) Representing the number of most class samples, a minness count (c) _i ) Representing a minority class sample number.

c) Clusters where ir is not between the two thresholds are removed.

It should be noted that the central SMOTE based on clustering improvement is an algorithm for solving the problem of SMOTE. Firstly, all data are clustered, unbalance rate of each cluster is calculated, and sampling effect is improved by removing clusters which are unnecessary to sample. In the modified SMOTE algorithm, there are two imbalance parameters: irt1 and irt. Wherein irt is used to reject clusters with a minority class of samples not less than a majority class of samples, because such clusters do not require sampling; irt2 is used to remove clusters with a very small number of minority class samples, which can result in distortion of the generated samples. Further, the threshold is a value greater than zero. Finally, the target cluster set after screening is reserved. And then calculating the intra-cluster average distance of a few types of samples in the reserved clusters, determining corresponding sampling weights according to the distance, determining the number of samples generated by each cluster according to the total generated user data sample number N, and then generating corresponding samples according to the improved central SMOTE.

d) Calculating average distance of few class samples in filtered clusters _j Where j refers to the number of the filtered cluster, j=1, 2.

e) After clusters where ir is not between the two thresholds are removed in the step c, the remaining clusters are calculated, and the sampling weight in each cluster is:。

f) Determining the number of samples in each cluster: n is n _j =w _j *(majority_count(c _j )-minority_count(c _j ))。

g) Generating a new sample: x is x _new =x+softmax(t)*(x _j -x), wherein t is a random number between-10 and 10, x _j Is the center point of the minority class samples in cluster j, and x represents the selected nearest neighbor minority class sample.

x represents the acquisition mode of the selected nearest neighbor minority class sample: calculating center points x of minority class samples in target cluster _j An average or weighted average can be calculated as the center point x _j . Selecting and centering point x _j Where k is a superparameter, preferably k is 5, according to the center point x _j And interpolating the minority class samples in the target cluster with the selected nearest neighbor minority class samples x to generate new samples.

4. Using a fusion model: the use of fusion models can help to improve the accuracy and stability of predictions. The specific flow is as follows:

a) The dataset was divided into three parts, training set (60%), validation set (20%) and test set (20%), respectively.

b) Common machine learning model training set data are used, wherein the common machine learning model training set data comprise decision trees, random forests, neighbor algorithms (KNN), gaussian naive Bayes, gradient lifting decision trees, edit regression, XGBoost, lightgbm and the like, top q base models with the best comprehensive performance are selected according to the prediction results of the models in the verification set, q models with the best comprehensive performance are not limited to a few, and the task is 5.

c) And optimizing the rest models by using a Bayesian optimization algorithm to obtain optimized models.

d) Equally dividing the training set, the test set and the verification set into five equal parts, fitting the optimized model to the training set by using a five-fold cross verification mode, generating one set of training set, verification set and test set data for each fitting, and generating complete new training set, verification set and test set data after five-fold training and prediction.

e) In order to avoid over fitting, a new data set is fitted by using simple model logistic regression, parameters are adjusted according to the verification set, and finally, a overdue user is predicted by using trained logistic regression; the formula used in the logistic regression model is:where e is a constant, called natural base, is a fixed value, about 2.71828182845904523536, X ₁ 、X ₂ 、X ₃ 、X ₄ 、X ₅ In order to optimize the output value of the model, f is a bias value, a, b, c, d and m are coefficients fitted by the logistic regression model, finally, the probability value is normalized and output through softmax, the threshold value of the probability value can be set to 0.5, the probability value is more than or equal to 0.5 and marked as overdue users (the label is 1), and the probability value is less than 0.5 and marked as non-overdue users (the label is 0).

In summary, according to the prediction method for overdue of the battery-powered package user provided by the embodiment of the application, the diversity of the sample set can be increased by expanding the random number range, repeated data caused by excessive sample set is avoided, the generation of the repeated data is effectively reduced, and the number of the synthesized boundary noise samples is reduced. Meanwhile, different information is learned by using different classification models through a fusion model structure, so that the accuracy and stability of prediction are improved. The application expands the random number range and uses the softmax function particularly in the SMOTE algorithm, can better control the distribution of the synthesized samples, reduce the generation of repeated data, reduce the number of noise samples and improve the accuracy and the stability of a machine learning model.

It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

In a second aspect, an embodiment of the present application provides a system for predicting overdue of a battery-change package user, where the system is used to implement the foregoing embodiments and preferred embodiments, and details thereof are not repeated. As used below, the terms "module," "unit," "sub-unit," and the like may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 9 is a block diagram of a system for predicting overdue of a battery-change package according to an embodiment of the present application, as shown in fig. 9, the system includes:

a historical sample set obtaining module 701, configured to obtain a historical sample set of a user, where the historical sample set includes behavior data, consumption data, and credit data of the user.

The new sample set obtaining module 702 is configured to determine an initial cluster set by performing clustering based on the historical sample set, and upsample each initial cluster in the initial cluster set to determine a new sample set.

The prediction model obtaining module 703 is configured to train a fusion model that is constructed in advance based on a data set obtained by combining the new sample set and the historical sample set, so as to obtain a prediction model.

And the user overdue result obtaining module 704 is configured to predict the user overdue for the user data to be predicted according to the prediction model, so as to obtain the overdue result of the battery-change package user.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

In a third aspect, an embodiment of the present application provides a device for predicting expiration of a battery-change package user, including a memory and one or more processors, where the memory stores executable code, and the one or more processors are configured to implement the steps in any one of the method embodiments described above when executing the executable code.

Optionally, the device for predicting overdue of the battery-change package user may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In addition, in combination with the method for predicting overdue of the battery-change package user in the above embodiment, the embodiment of the application can be implemented by providing a storage medium. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements the method for predicting expiration of any of the battery-change package users in the above embodiments.

In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a method for predicting expiration of a battery change package user. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

In one embodiment, fig. 10 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, as shown in fig. 10, and an electronic device, which may be a server, and an internal structure diagram of which may be shown in fig. 10, is provided. The electronic device includes a processor, a network interface, an internal memory, and a non-volatile memory connected by an internal bus, where the non-volatile memory stores an operating system, computer programs, and a database. The processor is used for providing computing and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing environment for the operation of an operating system and a computer program, the computer program is executed by the processor to realize a prediction method for overdue of a battery-change package user, and the database is used for storing data.

It will be appreciated by those skilled in the art that the structure shown in fig. 10 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the electronic device to which the present inventive arrangements are applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be understood by those skilled in the art that the technical features of the above-described embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above-described embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for predicting overdue of a battery-change package user, the method comprising:

the behavior data sample comprises at least one of riding distance, power change times and time interval of last power change;

The credit data sample comprises at least one of a user's qualifying for a mortgage, sesame credit, historical overdue conditions;

performing clustering processing on the historical sample set to determine an initial cluster set, and performing up-sampling on each initial cluster in the initial cluster set to determine a new sample set;

the step of determining an initial cluster set based on the historical sample set by clustering, and performing up-sampling on each initial cluster in the initial cluster set to determine a new sample set includes:

clustering according to the history sample set to obtain an initial cluster set c ₁ ,c ₂ ,…,c _k Wherein each of the initial clusters in the initial set of clusters comprises an initial minority sample set and an initial majority sample set;

determining an imbalance rate for each initial cluster in the initial cluster set based on the initial minority sample set and the initial majority sample set;

screening the initial cluster set according to the unbalance rate and a preset threshold interval to determine a target cluster set;

determining a new sample set based on a center point sample and other samples of a target minority sample set in the target cluster set;

wherein the determining a new sample set based on the center point samples and other samples of the target minority sample set in the target cluster set includes:

Determining the sampling weight of each target cluster in the target cluster set based on the average distance between samples in the target minority sample set, wherein the sampling weight of each target cluster is as followsWherein distance is _j For the average distance between samples in a target minority class sample set, j refers to the number of the filtered cluster, j=1, 2,..;

determining the target number of new samples in the corresponding target cluster according to the sampling weight, wherein the target number of the new samples is thatWherein, majority_count (c _j ) Representing the number of majority class samples in each target cluster, minness_count (c) _j ) Representing the number of minority class samples in each target cluster;

generating new samples of the target number in each target cluster according to the center point samples and other samples by using a preset difference model to obtain a new sample set, wherein the preset difference model is thatWherein x is _new For a new sample, t is taken from a random number between-10 and 10, x _j Is the center point of a minority class sample in the target cluster j, x represents a selected nearest neighbor minority class sample, and x represents the acquisition mode of the selected nearest neighbor minority class sample as follows: calculating center points x of minority class samples in target cluster _j Calculate the average or weighted average as the center point x _j Selecting and centering point x _j Any one of K neighbor samples, where K is a hyper-parameter;

training a fusion model constructed in advance based on a data set obtained by combining the new sample set and the historical sample set to obtain a prediction model;

2. The method for predicting overdue of a battery-change package user according to claim 1, wherein training a fusion model constructed in advance based on a data set obtained by combining the new sample set and the history sample set to obtain a prediction model comprises:

determining a new dataset by five-fold cross-validation on the optimal model based on the dataset;

training a logistic regression model in the fusion model based on the new data set until a preset condition is met, and obtaining the prediction model.

3. The method of claim 2, wherein training the machine learning model in the fusion model based on the dataset to determine a target optimal model comprises:

Training the machine learning model according to the training set in the data set, evaluating by adopting the verification set in the data set, and determining an optimal model according to an evaluation result;

and carrying out optimization treatment on the optimal model based on a Bayesian optimization algorithm to obtain the target optimal model.

4. The method of claim 2, wherein the new data set comprises a new training set, a new validation set, and a new test set; training the logistic regression model in the fusion model based on the new data set until a preset condition is met, so as to obtain the prediction model, wherein the training comprises the following steps:

and training the logistic regression model according to the new training set, adopting the new testing set for evaluation, and adopting the new verification set for model parameter adjustment according to an evaluation result until a preset condition is met, so as to obtain the prediction model.

5. A system for implementing the method of predicting the expiration of a battery change package user according to any one of claims 1-4, the prediction system comprising:

The new sample set acquisition module is used for determining an initial cluster set based on clustering processing of the historical sample set, up-sampling each initial cluster in the initial cluster set and determining a new sample set;

the prediction model acquisition module is used for training a fusion model constructed in advance based on the data set obtained by combining the new sample set and the historical sample set to obtain a prediction model;

6. A device for predicting the expiration of a battery-change package user, comprising a memory and one or more processors, wherein the memory has executable code stored therein, and wherein the one or more processors, when executing the executable code, are configured to implement the method for predicting the expiration of a battery-change package user of any one of claims 1-4.

7. A computer readable storage medium having stored thereon a program which, when executed by a processor, implements the method of predicting expiration of a battery change package user according to any one of claims 1-4.