WO2022126961A1 - 针对数据偏移的目标对象行为预测方法及其相关设备 - Google Patents
针对数据偏移的目标对象行为预测方法及其相关设备 Download PDFInfo
- Publication number
- WO2022126961A1 WO2022126961A1 PCT/CN2021/090162 CN2021090162W WO2022126961A1 WO 2022126961 A1 WO2022126961 A1 WO 2022126961A1 CN 2021090162 W CN2021090162 W CN 2021090162W WO 2022126961 A1 WO2022126961 A1 WO 2022126961A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- feature variable
- variable set
- value
- training
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 238000012549 training Methods 0.000 claims abstract description 164
- 238000012216 screening Methods 0.000 claims abstract description 38
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 230000006399 behavior Effects 0.000 claims description 67
- 238000012545 processing Methods 0.000 claims description 8
- 238000004140 cleaning Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 abstract description 2
- 239000003795 chemical substances by application Substances 0.000 description 14
- 230000007115 recruitment Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 230000014759 maintenance of location Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2113—Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/105—Human resources
- G06Q10/1053—Employment or hiring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/08—Insurance
Definitions
- the present application relates to the technical field of artificial intelligence, and in particular, to a method, apparatus, computer equipment and storage medium for predicting the behavior of a target object for data offset.
- the distribution and prediction ability of the target object's feature variables will fluctuate over time.
- Such unstable feature variables make the model prediction uncertain. increase, resulting in increased forecast risk.
- the method of feature selection is used to eliminate unstable feature variables, or to perform information smoothing processing on the feature variables.
- Existing solutions lose diversity of information in the process of reducing model risk, resulting in reduced model prediction accuracy.
- the purpose of the embodiments of the present application is to propose a target object behavior prediction method, device, computer equipment and storage medium for data offset, so as to solve the problem that the diversity of information is lost in the process of reducing model risk in the prior art, resulting in The problem of reducing the prediction accuracy of the model.
- the embodiment of the present application provides a target object behavior prediction method for data offset, which adopts the following technical solutions:
- a target object behavior prediction method for data offset comprising the following steps:
- Feature screening is performed on the preprocessed feature variables to generate a first feature variable set and a second feature variable set, wherein the prediction stability of each feature variable in the first feature variable set across time is higher than that of the first feature variable set.
- the preset LightGBM tree model is trained according to the first feature variable set and the corresponding weights to obtain a first training model and output the first training result. According to the first feature variable set and the third feature variable set Train the preset LightGBM tree model with the corresponding weight to obtain the second training model and output the second training result;
- the embodiment of the present application also provides a target object behavior prediction device for data offset, which adopts the following technical solutions:
- a target object behavior prediction device for data offset comprising:
- a feature acquisition module for acquiring historical data related to the behavior of the target object, extracting feature variables of multiple dimensions from the historical data, and preprocessing the feature variables;
- a feature set generation module configured to perform feature screening on the preprocessed feature variables, and generate a first feature variable set and a second feature variable set, wherein the prediction of each feature variable in the first feature variable set across time is stable are higher than the prediction stability of each feature variable across time in the second feature variable set;
- the assignment module is configured to perform secondary screening on the second feature variable set to obtain a third feature variable set, and use different assignment methods for the third feature variable set and the feature variables in the first feature variable set, respectively perform weight assignment;
- the training module is used for training the preset LightGBM tree model according to the first feature variable set and corresponding weights, obtaining a first training model and outputting a first training result, according to the first feature variable set, the The third feature variable set and the corresponding weight train the preset LightGBM tree model to obtain the second training model and output the second training result;
- a prediction module configured to output the second training model when the comparison result between the second training result and the first training result satisfies a preset condition, based on the first feature variable set, the third feature The set of variables and the second training model predict the behavior of the target object.
- the embodiment of the present application also provides a computer device, which adopts the following technical solutions:
- a computer device comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor implements the following steps when executing the computer-readable instructions:
- Feature screening is performed on the preprocessed feature variables to generate a first feature variable set and a second feature variable set, wherein the prediction stability of each feature variable in the first feature variable set across time is higher than that of the first feature variable set.
- the preset LightGBM tree model is trained according to the first feature variable set and the corresponding weights to obtain a first training model and output the first training result. According to the first feature variable set and the third feature variable set Train the preset LightGBM tree model with the corresponding weight to obtain the second training model and output the second training result;
- the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
- a computer-readable storage medium where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the processor is caused to perform the following steps:
- Feature screening is performed on the preprocessed feature variables to generate a first feature variable set and a second feature variable set, wherein the prediction stability of each feature variable in the first feature variable set across time is higher than that of the first feature variable set.
- the preset LightGBM tree model is trained according to the first feature variable set and the corresponding weights to obtain a first training model and output the first training result. According to the first feature variable set and the third feature variable set Train the preset LightGBM tree model with the corresponding weight to obtain the second training model and output the second training result;
- the target object behavior prediction method, device, computer equipment, and storage medium for data offset mainly have the following beneficial effects:
- the feature variables and their weight assignments are then input into the LightGBM tree model for training.
- the feature variables with data offset across time are entered into the model to solve the instability problem of the feature variables. Since the feature variables with data offset are retained, the richness of the feature variable set is guaranteed, and the prediction accuracy of the model is improved while reducing the model. risk.
- FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
- FIG. 2 is a flowchart of an embodiment of a method for predicting behavior of a target object for data offset according to the present application
- FIG. 3 is a schematic structural diagram of an embodiment of a target object behavior prediction apparatus for data offset according to the present application
- FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
- the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
- the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
- the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
- the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
- Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
- the terminal devices 101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
- MP3 players Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3
- MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
- the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
- the target object behavior prediction method for data offset provided by the embodiments of the present application is generally executed by a server, and accordingly, the target object behavior prediction device for data offset is generally set in the server.
- terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
- FIG. 2 shows a flowchart of one embodiment of a method for predicting behavior of a target object for data offset according to the present application.
- the described target object behavior prediction method for data offset includes the following steps:
- S202 Perform feature screening on the preprocessed feature variables to generate a first feature variable set and a second feature variable set, wherein the cross-time prediction stability of each feature variable in the first feature variable set is higher than that of all feature variables in the first feature variable set Describe the prediction stability of each feature variable across time in the second feature variable set;
- S203 Perform secondary screening on the second feature variable set to obtain a third feature variable set, and perform weight assignments on the third feature variable set and the feature variables in the first feature variable set using different assignment methods respectively ;
- the target object is an object that has prediction needs, specifically refers to an object that has behavior prediction, such as the behavior prediction of an insurance agent in an insurance agent recruitment scenario, when obtaining behaviors related to the target object.
- behavior prediction such as the behavior prediction of an insurance agent in an insurance agent recruitment scenario
- the attribute information of the target object and the associated information related to the target object are extracted from the historical data related to the behavior of the target object.
- the behavior in this scheme is Prediction can be used to predict the retention behavior of insurance agents after a certain period of time (such as 3 months) in the recruitment scenario of insurance agents, that is, whether insurance agents will resign.
- insurance agents When predicting the retention of insurance agents, insurance agents will The person is the target object, and the attribute information includes the basic information of insurance agents, such as gender, age, basic income, etc., and the related information related to the target object includes the pre-job performance of the agent recruitment (such as attendance, quiz scores, activity participation, etc.). Based on this information, the feature variables of multiple dimensions related to the behavior of the target object can be extracted, and the original feature variables used to predict the retention behavior of insurance agents can be obtained.
- the attribute information includes the basic information of insurance agents, such as gender, age, basic income, etc.
- the related information related to the target object includes the pre-job performance of the agent recruitment (such as attendance, quiz scores, activity participation, etc.).
- the second training model is trained in the training sample set of the three-feature variable set, each training sample is a labeled sample, the target variable of some training samples is "retention", and the target variable of another part of the training sample is "resignation",
- the second training model can be obtained by training, and then the prediction sample set (including data of multiple target objects) corresponding to the feature variables contained in the first feature variable set and the third feature variable set after the combination is input
- the second training model outputs the probability value that the behavior of the target object after a certain period of time is "retention” or "resignation”, and uses the prediction result with a larger probability value as the possible behavior of the target object, thereby completing the goal Object behavior predictions.
- the preprocessing of the feature variable includes: sequentially performing data cleaning, data variable binning and numerical encoding operations on the sample data of the feature variable. Specifically, after obtaining the characteristic variables, by analyzing the distribution characteristics of the data of the characteristic variables, including but not limited to data saturation, whether there are outliers, maximum value, minimum value, mean value, distribution type, etc., the data is then analyzed according to the distribution characteristics. Cleaning, processing dirty data, missing values, outliers, etc. in the acquired data. For example, when processing missing values, the missing rate exceeding the preset threshold can be deleted (the threshold can be set according to the situation, and can be 50%, 70%, 90% etc.) feature variables, which are excluded from the mold entry features.
- the threshold can be set according to the situation, and can be 50%, 70%, 90% etc.
- the multiple sample values of each feature variable are binned and coded.
- multiple sample values are binned according to equal frequency division to obtain several bins, and then calculate
- the WOE value is used to encode each bin of the feature variable.
- the missing value of the continuous variable can be replaced by a certain maximum value.
- each sample value is a bin.
- the missing value samples are self-contained into a box.
- IV IV full name is information value, information value or amount of information, used to evaluate the contribution of feature variables to the model
- PSI PSI full name is Population Stability Index, population stability
- the index is used to evaluate and evaluate the stability of the characteristic variable) value to screen and group the characteristic variables of the originally obtained characteristic variables, and screen out the first characteristic variable set with strong prediction ability, stable distribution across time, and stable prediction ability across time, and Screen out the second feature variable set with strong prediction ability, stable distribution across time but unstable prediction ability, that is, the prediction stability across time of each feature variable in the first feature variable set is higher than that of each feature variable in the second feature variable set Predictive stability across time.
- performing feature screening on the preprocessed feature variables to generate a first feature variable set and a second feature variable set includes: using sample data of the feature variables in multiple time periods as training sample set, and obtain the sample data of the feature variable in the target time period as a predicted sample set, calculate the IV value and PSI value of the feature variable based on the training sample set and the predicted sample set, and obtain the data from the original feature
- the characteristic variables whose IV value and the PSI value satisfy the first threshold group are screened out from the variable set, a first characteristic variable set is generated, and the IV value and all the characteristic variables are screened out from the remaining characteristic variables of the original characteristic variable set.
- the PSI value satisfies the characteristic variables of the second threshold group, and generates a second characteristic variable set.
- the multiple time periods refer to multiple historical time periods, such as the past six months, each month is a time period, which corresponds to six historical time periods, and the target time period refers to the time period to be
- the predicted time period the IV value includes the overall IV value, the monthly IV value and/or the monthly IV coefficient of variation value
- the PSI value includes the monthly PSI value and the predicted training PSI value
- the overall IV refers to the 6-month
- the monthly IV is the IV value of each monthly sample
- the monthly IV value can evaluate the monthly predictive ability of the feature
- the monthly IV coefficient of variation value can judge the stability of the predictive ability of each feature variable , different from the monthly IV value
- the overall IV can evaluate the overall predictive ability of the feature
- the monthly PSI value is to calculate the PSI value of each monthly sample set relative to its previous month distribution
- the predicted training PSI value is to calculate the predicted sample set relative to the training sample.
- the PSI value of the set distribution The PSI value of the set distribution.
- the first threshold group includes a first overall IV threshold, a first monthly IV mean threshold, a first monthly IV coefficient of variation threshold, a first monthly PSI mean threshold, and a first predictive training PSI threshold
- the second threshold group includes a second overall IV threshold, a second monthly IV mean threshold, a second monthly PSI mean threshold, and a second predictive training PSI threshold.
- the first threshold group and the second threshold Some thresholds in the group may be the same, for example, the first threshold group is (0.1, 0.1, 1, 0.25, 0.25), and the second threshold group is (0.5, 0.5, 0.25, 0.25).
- the process of obtaining the first feature variable set S1 by screening is to select the original feature variable set that simultaneously satisfies the overall IV value and the monthly IV mean value is greater than or equal to the limit value a (corresponding to the first overall IV threshold value and the first monthly IV mean value threshold) , the monthly IV coefficient of variation value is less than or equal to the limit value b (corresponding to the first monthly IV coefficient of variation threshold), the monthly average PSI and prediction-training PSI are less than or equal to the limit c (corresponding to the first monthly PSI average threshold and the first Predict the characteristic variables of training PSI threshold); the formula is expressed as follows:
- the process of obtaining the second feature variable set S2 by screening is to screen out S1 in the complement of the original feature variable set while satisfying the overall IV value, and the monthly IV mean value is greater than or equal to the limit value d (corresponding to the second overall IV threshold value, the first 2.
- the monthly IV mean threshold), the monthly PSI mean, and the characteristic variable of the predicted training PSI less than or equal to the limit value c (corresponding to the second monthly PSI mean threshold and the second predicted training PSI threshold); the formula is expressed as follows:
- the weight of this step is the initial weight of attention learning, and the purpose of performing secondary screening on the second feature variable set in this embodiment is to eliminate the unpredictable ones caused by the time span. characteristic variable.
- performing secondary screening on the second feature variable set to obtain a third feature variable set includes: month-by-month analysis of each feature variable in the second feature variable set based on a plurality of fitting functions IV performs curve fitting, and generates multiple prediction ability fluctuation curves for each characteristic variable; sequentially takes each characteristic variable as the current characteristic variable, and fits the root mean square error of the multiple prediction ability fluctuation curves of the current characteristic variable. Contrast, determine whether the ratio of the smallest fitting root mean square error to the monthly IV mean value of the current feature variable is greater than the preset threshold, and if it is greater than then further determine whether the monthly IV value of the current feature variable is monotonic, and in When not monotonic, the current feature variable is eliminated.
- the characteristic variables that are not predictive can be eliminated, the stability of the model prediction can be improved, and the prediction can be reduced.
- selecting the fitting curve with the smallest fitting root mean square error can ensure that the prediction has higher accuracy.
- weight assignment of the feature variables in the third feature variable set and the first feature variable set using different assignment methods respectively includes:
- a preset fixed weight is assigned to the feature variable in the first feature variable set; for the feature variable whose ratio is not greater than the preset threshold in the third feature variable set, according to the minimum fitting average
- the prediction ability fluctuation curve corresponding to the square root error is obtained, and its IV value in the target time period is obtained, and a weight assignment is performed based on the obtained IV value and the overall IV value;
- the preset threshold value and the corresponding monthly IV value monotonic characteristic variable, according to the IV value of its two nearest time periods from the target time period to obtain its IV value in the target time period, based on The obtained IV value and the overall IV value are weighted.
- the preset fixed weight in this embodiment is 1, and the preset threshold may be 0.2.
- the weight assignment based on the obtained IV value and the overall IV value is specifically multiplied by the ratio of the obtained IV value and the overall IV value by the weight coefficient, and the weight coefficient
- the value range is 0 to 1, and the initial value is 1.
- the fitting curve with the smallest error (RMSE), when the ratio of the root mean square error to the monthly IV mean value is less than or equal to the limit value e (that is, the preset threshold), the forecast month (that is, the target time period) is calculated based on the selected fitting curve.
- IV value be denoted as IV te ; If the ratio of root mean square error and monthly IV mean value does not exist in each fitting curve is less than or equal to the curve of limit value e, then judge the absolute monotonicity of monthly IV, that is, IV 1 ⁇ IV 2 ⁇ IV 3 ⁇ IV 4 ⁇ IV 5 ⁇ IV 6 or IV 1 ⁇ IV 2 ⁇ IV 3 ⁇ IV 4 ⁇ IV 5 ⁇ IV 6 , if the monthly IV satisfies absolute monotonicity, take the IV value of the forecast month IV te is equal to the mean of IV 5 and IV 6 , otherwise the feature variable X 1 is eliminated from the second feature variable set S2.
- each feature variable in the second feature variable set S2 and obtain a new feature variable set S3 by screening the second feature variable set S2, that is, the third feature variable set, and each feature variable in the set corresponds to a
- the predicted month IV value IV te , and the learning weight of each feature variable in the third feature variable set S3 is Where ⁇ is the weight coefficient (0 ⁇ 1), and the initial default value is 1.
- the method further includes:
- the comparison result between the second training result and the first training result does not satisfy the preset condition, adjust the weight coefficient corresponding to the feature variable in the third feature variable set, and obtain the first training result based on the weight coefficient.
- the new weights of the feature variables in the three feature variable sets and then perform model training based on the new weights, and then compare the first training result and the second training result until the comparison result satisfies the preset condition.
- the tree model attention learning mechanism based on feature weighted learning is realized, and the obtained model can incorporate the feature variables with strong predictive ability but due to data offset across time into the model.
- model training is performed based on the filtered feature variables and corresponding weights input to the LightGBM tree model.
- the LightGBM tree model is trained based on the first feature variable set S1 and the corresponding weights to obtain the first training model M0.
- the training result is the accuracy value of the prediction set, that is, the AUC value, denoted as AUC0; then, the LightGBM tree model is trained based on the first feature variable set S1 and the third feature variable set S3 and the corresponding weights, and the second training model M1 is obtained.
- the output The accuracy value of the second training prediction set, that is, the AUC value is denoted as AUC1.
- the method for predicting the behavior of a target object for data offset divides and assigns weights to the feature variables that occur due to data offset across time based on differences in prediction stability, and then inputs the feature variables and their weight assignments into the LightGBM tree
- the model is trained. Since the training process adopts the tree model attention learning mechanism based on feature weighted learning, the obtained model can incorporate the feature variables with strong predictive ability but due to data offset across time into the model, so as to solve the instability problem of feature variables. , since the feature variables with data offset are retained, the richness of the feature variable set is guaranteed, and the model prediction accuracy is improved while reducing the model risk.
- the behavior prediction result It can also be stored in the nodes of a blockchain.
- the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
- the present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like.
- the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
- program modules include routines, computer readable instructions, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
- program modules may be located in both local and remote computer storage media including storage devices.
- the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
- the present application provides an embodiment of a target object behavior prediction device for data offset, which is the same as the method embodiment shown in FIG. 2 .
- the apparatus can be specifically applied to various electronic devices.
- the target object behavior prediction device for data offset described in this embodiment includes: a feature acquisition module 301 , a feature set generation module 302 , an assignment module 303 , a training module 304 , and a prediction module 305 .
- the feature acquisition module 301 is used for acquiring historical data related to the behavior of the target object, extracting feature variables of multiple dimensions from the historical data, and preprocessing the feature variables;
- the feature set generating module 302 is used for Feature screening is performed on the preprocessed feature variables to generate a first feature variable set and a second feature variable set, wherein the prediction stability of each feature variable in the first feature variable set across time is higher than that of the first feature variable set.
- the assignment module 303 is configured to perform secondary screening on the second feature variable set to obtain a third feature variable set, and compare the third feature variable set and the The feature variables in the first feature variable set use different assignment methods to assign weights respectively;
- the training module 304 is used to train the preset LightGBM tree model according to the first feature variable set and the corresponding weights to obtain the first training.
- the prediction module 305 is configured to output the second training model when the comparison result between the second training result and the first training result satisfies a preset condition, based on the first feature variable set, the third feature The set of variables and the second training model predict the behavior of the target object.
- the feature acquisition module 301 when the feature acquisition module 301 preprocesses the feature variable, it is specifically configured to sequentially perform data cleaning, data variable binning and numerical encoding operations on the sample data of the feature variable.
- the process of obtaining the original feature variable set by the feature acquisition module 301 and the process of preprocessing may refer to the above method embodiments, which will not be expanded here.
- the feature set generation module 302 performs feature variable screening and grouping on the originally obtained feature variables based on the IV value and the PSI value, and performs feature screening on the preprocessed feature variables to generate the first feature variable set and the first feature variable set.
- it is specifically used for: taking the sample data of the feature variable in multiple time periods as a training sample set, and acquiring the sample data of the feature variable in the target time period as a prediction sample set, based on the training
- the sample set and the predicted sample set calculate the IV value and the PSI value of the feature variable, and select the feature variable whose IV value and the PSI value meet the first threshold group from the original feature variable set, and generate a first A feature variable set is selected, and the feature variables whose IV value and the PSI value satisfy the second threshold group are selected from the remaining feature variables of the original feature variable set to generate a second feature variable set.
- the above method embodiments which are not expanded here.
- the weight assigned by the assignment module 303 is the initial weight of attention learning, and the purpose of performing secondary screening on the second feature variable set in this embodiment is to eliminate the unpredictable ones caused by the time span. characteristic variable.
- the assignment module 303 performs secondary screening on the second feature variable set to obtain a third feature variable set, it is specifically used for: based on multiple fitting functions, each feature variable in the second feature variable set Curve fitting is performed month by month, and multiple prediction ability fluctuation curves are generated for each characteristic variable; each characteristic variable is used as the current characteristic variable in turn, and the fitting of the multiple prediction ability fluctuation curves of the current characteristic variable is average.
- the root square error is compared, and it is judged whether the ratio of the minimum fitting root mean square error to the monthly IV mean value of the current feature variable is greater than the preset threshold, and if it is greater than the monthly IV value of the current feature variable is further judged whether it is monotonic. , and remove the current feature variable when it is not monotonic.
- the assignment module 303 uses different assignment methods to assign weights to the feature variables in the third feature variable set and the first feature variable set, it is specifically used for: assigning weights to the first feature variable.
- the feature variables in the set are given preset fixed weights; for the feature variables whose ratio is not greater than the preset threshold in the third feature variable set, the prediction ability corresponding to the minimum fitting root mean square error is The fluctuation curve, obtain its IV value in the target time period, and carry out weight assignment based on the obtained IV value and the overall IV value; the ratio in the third feature variable set is greater than the preset threshold and corresponds to The characteristic variable of the monthly IV value monotonic, according to its IV value of the two nearest time periods from the target time period to obtain its IV value in the target time period, based on the obtained IV value and the described The overall IV value is weighted.
- the preset fixed weight in this embodiment is 1, and the preset threshold may be 0.2.
- the weight assignment based on the obtained IV value and the overall IV value is specifically multiplied by the ratio of the obtained IV value and the overall IV value by the weight coefficient, and the weight coefficient
- the value range is 0 to 1, and the initial value is 1.
- the training module 304 when the prediction module 305 determines that the comparison result of the second training result and the first training result does not meet the preset condition, the training module 304 will be adjusted to the third feature variable set.
- the weight coefficients corresponding to the feature variables in obtain the new weights of the feature variables in the third feature variable set based on the weight coefficients, and then perform model training based on the new weights, and then the prediction module 305 performs the first training.
- the result is compared with the second training result until the comparison result satisfies the preset condition.
- a tree model attention learning mechanism based on feature weighted learning is realized. into the mold.
- the target object behavior prediction device for data offset provided by the present application divides and assigns weights to the feature variables that occur due to data offsets across time based on differences in prediction stability, and then inputs the feature variables and their weight assignments into the LightGBM tree
- the model is trained. Since the training process adopts the tree model attention learning mechanism based on feature weighted learning, the obtained model can incorporate the feature variables with strong predictive ability but due to data offset across time into the model, so as to solve the instability problem of feature variables. , since the feature variables with data offset are retained, the richness of the feature variable set is guaranteed, and the model prediction accuracy is improved while reducing the model risk.
- FIG. 4 is a block diagram of a basic structure of a computer device according to this embodiment.
- the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus.
- the memory 41 stores computer-readable instructions
- the processor 42 implements the above when executing the computer-readable instructions.
- the steps of the target object behavior prediction method for data offset described in the method embodiments have beneficial effects corresponding to the above-mentioned target object behavior prediction method for data offset, and are not described here.
- the computer device 4 having the memory 41, the processor 42, and the network interface 43 is shown in the figure, but it should be understood that it is not required to implement all the components shown, and more or more components may be implemented instead. Fewer components.
- the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- DSP Digital Signal Processor
- the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
- the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
- the memory 41 includes at least one type of readable storage medium, and the computer-readable storage medium may be non-volatile or volatile.
- the readable storage medium Including flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable only memory Read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
- the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 .
- the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
- the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
- the memory 41 is generally used to store the operating system and various application software installed on the computer device 4, such as computer-readable instructions corresponding to the above-mentioned method for predicting the behavior of the target object for data offset.
- the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
- the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, execute computer-readable instructions corresponding to the method for predicting the behavior of a target object for data offset.
- CPU Central Processing Unit
- the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
- the present application also provides another implementation manner, that is, to provide a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores Computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the method for predicting the behavior of a target object for data offsets as described above, with the The beneficial effects corresponding to the offset target object behavior prediction method are not expanded here.
- the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
- the technical embodiments of the present application can be embodied in the form of software products that are essentially or contribute to the prior art.
- the computer software products are stored in a storage medium (such as ROM/RAM, magnetic disk, etc. , CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the various embodiments of the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Artificial Intelligence (AREA)
- Economics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Business, Economics & Management (AREA)
- Evolutionary Biology (AREA)
- Marketing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Entrepreneurship & Innovation (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Tourism & Hospitality (AREA)
- Development Economics (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Game Theory and Decision Science (AREA)
- Technology Law (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
属于人工智能领域,涉及针对数据偏移的目标对象行为预测方法及其相关设备,所述方法包括:获取特征变量并进行预处理和特征筛选,生成第一特征变量集合和第二特征变量集合;对第一特征变量集合进行权重赋值,根据第二特征变量集合得到第三特征变量集合并进行权重赋值;根据第一特征变量集合和对应的权重进行模型训练输出训练结果,根据第一特征变量集合、第三特征变量集和对应的权重进行模型训练得到训练结果;当两个训练结果的比较结果满足预设条件时,输出第二训练模型进行行为预测。还涉及区块链技术,前述行为预测结果可存储于区块链中。可在提升模型预测精度的同时降低模型风险。
Description
本申请要求于2020年12月16日提交中国专利局、申请号为202011487422.8,发明名称为“针对数据偏移的目标对象行为预测方法及其相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能技术领域,具体涉及一种针对数据偏移的目标对象行为预测方法、装置、计算机设备及存储介质。
在目标对象的训练集和预测集存在较长时间间隔的模型预测场景中,目标对象的特征变量的分布和预测能力会随时间产生一定波动,此类不稳定的特征变量使得模型预测不确定性增加,导致预测风险加大。目前为了降低模型预测风险,利用特征选择的方式剔除不稳定的特征变量,或者对特征变量进行信息平滑化处理,发明人意识到,这些不稳定的特征变量中仍存在有利于模型预测的信息,现有的解决方案在降低模型风险的过程中损失了信息的多样性,导致模型的预测精度降低。
发明内容
本申请实施例的目的在于提出一种针对数据偏移的目标对象行为预测方法、装置、计算机设备及存储介质,以解决现有技术中在降低模型风险的过程中损失了信息的多样性,导致模型的预测精度降低的问题。
为了解决上述技术问题,本申请实施例提供一种针对数据偏移的目标对象行为预测方法,采用了如下所述的技术方案:
一种针对数据偏移的目标对象行为预测方法,包括下述步骤:
获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;
对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;
对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;
根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;
当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
为了解决上述技术问题,本申请实施例还提供一种针对数据偏移的目标对象行为预测装置,采用了如下所述的技术方案:
一种针对数据偏移的目标对象行为预测装置,包括:
特征获取模块,用于获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;
特征集合生成模块,用于对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定 性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;
赋值模块,用于对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;
训练模块,用于根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;
预测模块,用于当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下的步骤:
获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;
对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;
对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;
根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;
当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行如下步骤:
获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;
对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;
对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;
根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;
当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
与现有技术相比,本申请实施例提供的针对数据偏移的目标对象行为预测方法、装置、 计算机设备及存储介质主要有以下有益效果:
通过对因跨时间发生数据偏移的特征变量基于预测稳定性的不同进行划分和权重赋值,再将特征变量及其权重赋值输入LightGBM树模型进行训练,得到的模型可将预测能力强、但因跨时间发生数据偏移的特征变量入模,解决特征变量的不稳定问题,由于保留了发生数据偏移的特征变量,从而保证了特征变量集的丰富度,在提升模型预测精度的同时降低模型风险。
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,下面描述中的附图对应于本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请可以应用于其中的示例性系统架构图;
图2是根据本申请的针对数据偏移的目标对象行为预测方法的一个实施例的流程图;
图3是根据本申请的针对数据偏移的目标对象行为预测装置的一个实施例的结构示意图;
图4是根据本申请的计算机设备的一个实施例的结构示意图。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的针对数据偏移的目标对象行为预测方法一般由服务器执行,相应地,针对数据偏移的目标对象行为预测装置一般设置于服务器中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要, 可以具有任意数目的终端设备、网络和服务器。
继续参考图2,其示出了根据本申请的针对数据偏移的目标对象行为预测方法的一个实施例的流程图。所述的针对数据偏移的目标对象行为预测方法包括以下步骤:
S201,获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;
S202,对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;
S203,对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;
S204,根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;
S205,当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
下面对上述步骤进行展开说明。
对于步骤S201,在本实施例中目标对象为存在预测需求的对象,具体指存在行为预测的对象,比如在保险代理人招聘场景中的保险代理人的行为预测,在获取与目标对象的行为相关的多个维度的特征变量时,具体从与目标对象的行为相关的历史数据中提取目标对象的属性信息和与目标对象相关的关联信息,比如在保险代理人招聘场景中,本方案中的行为预测可为对保险代理人招聘场景中保险代理人在某个时间段(比如3个月)后留存行为的预测,即保险代理人是否会出现离职行为,预测保险代理人的留存时,保险代理人即为目标对象,属性信息包括保险代理人基本信息,如性别、年龄、基本收入等,与目标对象相关的关联信息包括包括代理人招聘岗前班表现(如考勤、小测验成绩、活动参与度等)、保险代理人平台活跃情况、历史购买保单信息等信息,基于这些信息可提取与目标对象的行为相关的多个维度的特征变量,得到用于预测保险代理人留存行为的原始特征变量集,在具体实施例中,在通过本步骤的数据预处理以及后续步骤S202至步骤S203的特征变量的筛选后,在通过步骤S204进行模型训练时,所述第一特征变量集合、所述第三特征变量集合的训练样本集所述第二训练模型进行训练,每个训练样本均为标记样本,部分训练样本的目标变量为“留存”,而另一部分训练样本的目标变量为“离职”,由此可以训练得到所述第二训练模型,再将所述第一特征变量集合、所述第三特征变量集合并后包含的特征变量对应的预测样本集(包含多个目标对象的数据)输入所述第二训练模型,输出目标对象在某个时间段后的行为为“留存”或者“离职”的概率值,将概率值较大的预测结果作为目标对象可能出现的行为,由此完成目标对象的行为预测。进一步地,对所述特征变量进行预处理包括:对所述特征变量的样本数据依次进行数据清洗、数据变量分箱和数值化编码操作。具体的,在得到特征变量后,通过分析特征变量的数据的分布特征,包括但不限于数据饱和度、是否存在异常值、最大值、最小值、均值、分布类型等,之后根据分布特征进行数据清洗,处理获取的数据中的脏数据、缺失值、异常值等,比如处理缺失值时,可删除缺失率超过预设的阈值(阈值根据情况自行设定,可取50%、70%、90%等)的特征变量,从入模特征中剔除。完成数据清洗后,将各特征变量的多个样本值进行分箱和编码,其中,对于连续型变量,按照等频划分的方式将多个样本值进行分箱操作,得到若干分箱,再计算每个分箱的WOE值,以WOE值对特征变量的各个分箱进行编码,编码过程中连续型变量的缺失值可以某极大值代替,对于离散型变量则每个样本值为一个分箱,且 缺失值样本自成一箱,对分箱进行数值化编码时可以对应分箱的目标命中率进行编码。
对于步骤S202,在本实施例中,具体基于IV(IV全称为information value,信息价值或信息量,用于评估特征变量对模型的贡献度)、PSI(PSI全称为Population Stability Index,群体稳定性指数,用于评估评估特征变量稳定性)值对原始获得的特征变量进行特征变量的筛选和分群,筛选出预测能力强、跨时间分布稳定且跨时间预测能力稳定的第一特征变量集合,并筛选出预测能力强、跨时间分布稳定但预测能力不稳定的第二特征变量集合,即第一特征变量集合中各特征变量跨时间的预测稳定性均高于第二特征变量集合中各特征变量跨时间的预测稳定性。
在一些实施例中,所述对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合包括:将所述特征变量在多个时间段的样本数据作为训练样本集,并获取所述特征变量在目标时间段的样本数据作为预测样本集,基于所述训练样本集和所述预测样本集计算所述特征变量的IV值和PSI值,从所述原始特征变量集中筛选出所述IV值和所述PSI值满足第一阈值组的特征变量,生成第一特征变量集,并从所述原始特征变量集剩余的特征变量中筛选出所述IV值和所述PSI值满足第二阈值组的特征变量,生成第二特征变量集。
具体的,所述多个时间段是指多个历史的时间段,比如过去的六个月,每个月为一个时间段,则对应六个历史的时间段,所述目标时间段则指待预测的时间段,所述IV值包括整体IV值、逐月IV值和/或逐月IV变异系数值,所述PSI值包括逐月PSI值和预测训练PSI值,整体IV指6个月的整个样本的IV值,逐月IV即每个月样本的IV值,逐月IV值可评估特征的单月预测能力,且通过逐月IV变异系数值可判断各特征变量的预测能力的稳定性,区别于逐月IV值,整体IV可评估特征的整体预测能力,逐月PSI值即计算每个月样本集合相对其上月分布的PSI值,预测训练PSI值即计算预测样本集合相对训练样本集分布的PSI值。
在本实施例中,所述第一阈值组包括第一整体IV阈值、第一逐月IV均值阈值、第一逐月IV变异系数阈值、第一逐月PSI均值阈值和第一预测训练PSI阈值,所述第二阈值组包括第二整体IV阈值、第二逐月IV均值阈值、第二逐月PSI均值阈值和第二预测训练PSI阈值,在本实施例中第一阈值组和第二阈值组中的部分阈值可以相同,比如第一阈值组为(0.1,0.1,1,0.25,0.25),第二阈值组为(0.5,0.5,0.25,0.25)。
下面以智慧代理人招聘场景为例说明特征变量的筛选过程:
首先,选取6个月入司人群为训练样本集,1个月入司人群为预测样本集;计算各特征的整体IV值(记为IV
ALL)、逐月IV值(分别记为IV
1、IV
2、IV
3、IV
4、IV
5、IV
6)、逐月PSI(分别记为PSI
21、PSI
32、PSI
43、PSI
54、PSI
65)、预测训练PSI(记为PSI
te-tr)。
其次,筛选得到第一特征变量集合S1的过程为,选择原始特征变量集中同时满足整体IV值、逐月IV均值大于等于极限值a(对应第一整体IV阈值和第一逐月IV均值阈值),逐月IV变异系数值小于等于极限值b(对应第一逐月IV变异系数阈值),逐月PSI均值、预测-训练PSI小于等于极限值c(对应第一逐月PSI均值阈值和第一预测训练PSI阈值)的特征变量;公式表达如下:
最后,筛选得到第二特征变量集合S2的过程为,从S1在原始特征变量集的补集中筛选出同时满足整体IV值、逐月IV均值大于等于极限值d(对应第二整体IV阈值、第二逐月IV均值阈值),逐月PSI均值、预测训练PSI小于等于极限值c(对应第二逐月PSI均 值阈值和第二预测训练PSI阈值)的特征变量;公式表达如下:
对于步骤S203,在本实施例中,本步骤的权重为注意力学习初始权重,对第二特征变量集合进行二次筛选在本实施例中的目的是剔除因时间跨度导致的不具有预测性的特征变量。
在一些实施例中,所述对所述第二特征变量集合进行二次筛选得到第三特征变量集合包括:基于多个拟合函数对所述第二特征变量集合中的各特征变量的逐月IV进行曲线拟合,对每个特征变量生成多条预测能力波动曲线;依次将每个特征变量作为当前特征变量,对所述当前特征变量的多条预测能力波动曲线的拟合均方根误差进行对比,判断最小的拟合均方根误差与当前特征变量的逐月IV均值的比值是否大于预设阈值,若大于则进一步判断所述当前特征变量的各逐月IV值是否单调,并在不单调时将所述当前特征变量剔除。通过基于最小的拟合均方根误差与逐月IV均值的比值是否大于预设阈值以及逐月IV值是否单调的判断可以剔除不具有预测性的特征变量,提高模型预测的稳定性,降低预测风险,同时选取拟合均方根误差最小的拟合曲线可以保证预测具有更高的精度。
进一步地,所述对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值包括:
对所述第一特征变量集合中的特征变量赋予预设的固定权重;对所述第三特征变量集合中所述比值不大于所述预设阈值的特征变量,根据所述最小的拟合均方根误差对应的预测能力波动曲线,求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值;对第三特征变量集合中所述比值大于所述预设阈值、且对应的所述逐月IV值单调的特征变量,根据其距所述目标时间段最近的两个时间段的IV值求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值。
其中,本实施例中预设的固定权重为1,预设阈值可取0.2,基于得到IV值和整体IV值进行权重赋值具体为将得到IV值和整体IV值的比值乘以权重系数,权重系数的取值范围为0至1,初始取值为1。
下面以智慧代理人招聘场景为例说明特征变量的二次筛选及权重赋值的过程:
以第二特征变量集S2中某一特征变量X
1为例,对特征变量X
1的逐月IV做曲线拟合,拟合曲线类型可包含y=a*x+b、y=a*ln(x)+b、y=a*x
2+b、y=a*sin(x)+b等,由此可得到相应数量的拟合曲线,即预测能力波动曲线,选取拟合均方根误差(RMSE)最小的拟合曲线,当均方根误差与逐月IV均值的比值小于等于极限值e(即预设阈值)时,基于已选择的拟合曲线计算预测月份(即目标时间段)的IV值,记为IV
te;若各拟合曲线中不存在均方根误差与逐月IV均值的比值小于等于极限值e的曲线,则判断逐月IV的绝对单调性,即IV
1≤IV
2≤IV
3≤IV
4≤IV
5≤IV
6或者IV
1≥IV
2≥IV
3≥IV
4≥IV
5≥IV
6,若逐月IV满足绝对单调性,则取预测月份IV值IV
te等于IV
5和IV
6的均值,否则将特征变量X
1从第二特征变量集合S2中剔除。
对第二特征变量集S2中的每一个特征变量重复上述过程,由第二特征变量集S2筛选得到一个新的特征变量集S3,即第三特征变量集,且集合中每一个特征变量对应一个预测月份IV值IV
te,且第三特征变量集S3中每一个特征变量的学习权重为
其中γ为权重系数(0≤γ≤1),初始默认值为1。
对于步骤S204和步骤S205,在本实施例中,所述方法还包括:
当所述第二训练结果与所述第一训练结果的比较结果不满足预设条件时,调整所述第三特征变量集合中的特征变量对应的权重系数,基于所述权重系数得到所述第三特征变量集合中的特征变量的新的权重,再基于新的权重进行模型训练,之后进行所述第一训练结 果和所述第二训练结果的比较,直到所述比较结果满足所述预设条件。通过调整权重,实现基于特征加权学习的树模型注意力学习机制,得到的模型可将预测能力强、但因跨时间发生数据偏移的特征变量入模。
具体的,模型训练基于筛选后的特征变量和对应的权重输入给LightGBM树模型进行,首先,基于第一特征变量集S1及相应的权重训练LightGBM树模型得到第一训练模型M0,输出的第一训练结果为预测集精度值,即AUC值,记为AUC0;然后,基于第一特征变量集S1和第三特征变量集S3及相应的权重训练LightGBM树模型,得到第二训练模型M1,输出的第二训练预测集精度值,即AUC值,记为AUC1。
比较AUC1和AUC0,若AUC1大于或等于AUC0,则输出模型M1,基于模型M1对目标对象进行预测;若AUC1小于AUC0,则需要调整权重系数γ,结合上述式子
具体通过将权重系数γ调小后重复步骤204和205,直到AUC1大于或等于AUC0。
本申请提供的针对数据偏移的目标对象行为预测方法,通过对因跨时间发生数据偏移的特征变量基于预测稳定性的不同进行划分和权重赋值,再将特征变量及其权重赋值输入LightGBM树模型进行训练,由于训练过程采用基于特征加权学习的树模型注意力学习机制,得到的模型可将预测能力强、但因跨时间发生数据偏移的特征变量入模,解决特征变量的不稳定问题,由于保留了发生数据偏移的特征变量,从而保证了特征变量集的丰富度,在提升模型预测精度的同时降低模型风险。
为进一步保证信息的私密和安全性,在所述基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测的步骤之后,行为预测结果还可以存储于一区块链的节点中。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、计算机可读指令、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
进一步参考图3,作为对上述图2所示方法的实现,本申请提供了一种针对数据偏移 的目标对象行为预测装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图3所示,本实施例所述的针对数据偏移的目标对象行为预测装置包括:特征获取模块301、特征集合生成模块302、赋值模块303、训练模块304以及预测模块305。其中,特征获取模块301用于获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;特征集合生成模块302用于对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;赋值模块303用于对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;训练模块304用于根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;预测模块305用于当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
在本实施例中特征获取模块301对所述特征变量进行预处理时具体用于对所述特征变量的样本数据依次进行数据清洗、数据变量分箱和数值化编码操作。其中,特征获取模块301得到原始特征变量集的过程及进行预处理的过程可参考上述方法实施例,在此不作展开。
进一步地,特征集合生成模块302基于IV值、PSI值对原始获得的特征变量进行特征变量的筛选和分群,其对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合时,具体用于:将所述特征变量在多个时间段的样本数据作为训练样本集,并获取所述特征变量在目标时间段的样本数据作为预测样本集,基于所述训练样本集和所述预测样本集计算所述特征变量的IV值和PSI值,从所述原始特征变量集中筛选出所述IV值和所述PSI值满足第一阈值组的特征变量,生成第一特征变量集,并从所述原始特征变量集剩余的特征变量中筛选出所述IV值和所述PSI值满足第二阈值组的特征变量,生成第二特征变量集。具体可参考上述方法实施例,在此不作展开。
在本实施例中,所述赋值模块303赋予的权重为注意力学习初始权重,对第二特征变量集合进行二次筛选在本实施例中的目的是剔除因时间跨度导致的不具有预测性的特征变量。其中,所述赋值模块303对所述第二特征变量集合进行二次筛选得到第三特征变量集合时,具体用于:基于多个拟合函数对所述第二特征变量集合中的各特征变量的逐月IV进行曲线拟合,对每个特征变量生成多条预测能力波动曲线;依次将每个特征变量作为当前特征变量,对所述当前特征变量的多条预测能力波动曲线的拟合均方根误差进行对比,判断最小的拟合均方根误差与当前特征变量的逐月IV均值的比值是否大于预设阈值,若大于则进一步判断所述当前特征变量的各逐月IV值是否单调,并在不单调时将所述当前特征变量剔除。通过基于最小的拟合均方根误差与逐月IV均值的比值是否大于预设阈值以及逐月IV值是否单调的判断可以剔除不具有预测性的特征变量,提高模型预测的稳定性,降低预测风险,同时选取拟合均方根误差最小的拟合曲线可以保证预测具有更高的精度。
进一步地,所述赋值模块303对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值时,具体用于:对所述第一特征变量集合中的特征变量赋予预设的固定权重;对所述第三特征变量集合中所述比值不大于所述预设阈值的特征变量,根据所述最小的拟合均方根误差对应的预测能力波动曲线,求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值;对第三特征变量集合中所述比值大于所述预设阈值、且对应的所述逐月IV值单调的特征变量,根据其 距所述目标时间段最近的两个时间段的IV值求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值。
其中,本实施例中预设的固定权重为1,预设阈值可取0.2,基于得到IV值和整体IV值进行权重赋值具体为将得到IV值和整体IV值的比值乘以权重系数,权重系数的取值范围为0至1,初始取值为1。
上述各模块的执行过程以智慧代理人招聘场景为例的说明具体可参考上述方法实施例,在此不作展开。
进一步地,在本实施例中,当预测模块305判定所述第二训练结果与所述第一训练结果的比较结果不满足预设条件时,将使训练模块304调整所述第三特征变量集合中的特征变量对应的权重系数,基于所述权重系数得到所述第三特征变量集合中的特征变量的新的权重,再基于新的权重进行模型训练,之后预测模块305进行所述第一训练结果和所述第二训练结果的比较,直到所述比较结果满足所述预设条件。具体可参考上述方法实施例,在此不作展开,通过调整权重,实现基于特征加权学习的树模型注意力学习机制,得到的模型可将预测能力强、但因跨时间发生数据偏移的特征变量入模。
本申请提供的针对数据偏移的目标对象行为预测装置,通过对因跨时间发生数据偏移的特征变量基于预测稳定性的不同进行划分和权重赋值,再将特征变量及其权重赋值输入LightGBM树模型进行训练,由于训练过程采用基于特征加权学习的树模型注意力学习机制,得到的模型可将预测能力强、但因跨时间发生数据偏移的特征变量入模,解决特征变量的不稳定问题,由于保留了发生数据偏移的特征变量,从而保证了特征变量集的丰富度,在提升模型预测精度的同时降低模型风险。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43,所述存储器41中存储有计算机可读指令,所述处理器42执行所述计算机可读指令时实现上述方法实施例中所述的针对数据偏移的目标对象行为预测方法的步骤,并具有与上述针对数据偏移的目标对象行为预测方法相对应的有益效果,在此不作展开。
需要指出的是,图中仅示出了具有存储器41、处理器42、网络接口43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
在本实施例中,所述存储器41至少包括一种类型的可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,具体的,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计 算机设备4的操作系统和各类应用软件,例如对应于上述针对数据偏移的目标对象行为预测方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者处理数据,例如运行对应于所述针对数据偏移的目标对象行为预测方法的计算机可读指令。
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的针对数据偏移的目标对象行为预测方法的步骤,并具有与上述针对数据偏移的目标对象行为预测方法相对应的有益效果,在此不作展开。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术实施例本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术实施例进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。
Claims (20)
- 一种针对数据偏移的目标对象行为预测方法,包括下述步骤:获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
- 根据权利要求1所述的针对数据偏移的目标对象行为预测方法,其中,所述方法还包括:当所述第二训练结果与所述第一训练结果的比较结果不满足预设条件时,调整所述第三特征变量集合中的特征变量对应的权重系数,基于所述权重系数得到所述第三特征变量集合中的特征变量的新的权重,再基于新的权重进行模型训练,之后进行所述第一训练结果和所述第二训练结果的比较,直到所述比较结果满足所述预设条件。
- 根据权利要求2所述的针对数据偏移的目标对象行为预测方法,其中,所述对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合包括:将所述特征变量在多个时间段的样本数据作为训练样本集,并获取所述特征变量在目标时间段的样本数据作为预测样本集,基于所述训练样本集和所述预测样本集计算所述特征变量的IV值和PSI值,从所述原始特征变量集中筛选出所述IV值和所述PSI值满足第一阈值组的特征变量,生成第一特征变量集,并从所述原始特征变量集剩余的特征变量中筛选出所述IV值和所述PSI值满足第二阈值组的特征变量,生成第二特征变量集。
- 根据权利要求2或3所述的针对数据偏移的目标对象行为预测方法,其中,所述IV值包括逐月IV值和逐月IV均值,所述对所述第二特征变量集合进行二次筛选得到第三特征变量集合包括:基于多个拟合函数对所述第二特征变量集合中的各特征变量的逐月IV进行曲线拟合,对每个特征变量生成多条预测能力波动曲线;依次将每个特征变量作为当前特征变量,对所述当前特征变量的多条预测能力波动曲线的拟合均方根误差进行对比,判断最小的拟合均方根误差与当前特征变量的逐月IV均值的比值是否大于预设阈值,若大于则进一步判断所述当前特征变量的各逐月IV值是否单调,并在不单调时将所述当前特征变量剔除。
- 根据权利要求4所述的针对数据偏移的目标对象行为预测方法,其中,所述IV值还包括整体IV值,所述对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值包括:对所述第一特征变量集合中的特征变量赋予预设的固定权重;对所述第三特征变量集合中所述比值不大于所述预设阈值的特征变量,根据所述最小的拟合均方根误差对应的预测能力波动曲线,求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值;对第三特征变量集合中所述比值大于所述预设阈值、且对应的所述逐月IV值单调的特征变量,根据其距所述目标时间段最近的两个时间段的IV值求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值。
- 根据权利要求1至3任一项所述的针对数据偏移的目标对象行为预测方法,其中,对所述特征变量进行预处理包括:对所述特征变量的样本数据依次进行数据清洗、数据变量分箱和数值化编码操作。
- 根据权利要求1至3任一项所述的针对数据偏移的目标对象行为预测方法,其中,在所述基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测的步骤之后还包括:将行为预测结果存储至区块链中。
- 一种针对数据偏移的目标对象行为预测装置,包括:特征获取模块,用于获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;特征集合生成模块,用于对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;赋值模块,用于对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;训练模块,用于根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;预测模块,用于当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
- 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下的步骤:获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
- 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时还实现如下的步骤:当所述第二训练结果与所述第一训练结果的比较结果不满足预设条件时,调整所述第三特征变量集合中的特征变量对应的权重系数,基于所述权重系数得到所述第三特征变量集合中的特征变量的新的权重,再基于新的权重进行模型训练,之后进行所述第一训练结果和所述第二训练结果的比较,直到所述比较结果满足所述预设条件。
- 根据权利要求10所述的计算机设备,其中,所述处理器执行所述计算机可读指令实现所述对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合的步骤时,具体实现如下步骤:将所述特征变量在多个时间段的样本数据作为训练样本集,并获取所述特征变量在目标时间段的样本数据作为预测样本集,基于所述训练样本集和所述预测样本集计算所述特征变量的IV值和PSI值,从所述原始特征变量集中筛选出所述IV值和所述PSI值满足第一阈值组的特征变量,生成第一特征变量集,并从所述原始特征变量集剩余的特征变量中筛选出所述IV值和所述PSI值满足第二阈值组的特征变量,生成第二特征变量集。
- 根据权利要求10或11所述的计算机设备,其中,所述IV值包括逐月IV值和逐月IV均值,所述处理器执行所述计算机可读指令实现所述对所述第二特征变量集合进行二次筛选得到第三特征变量集合的步骤时,具体实现如下步骤:基于多个拟合函数对所述第二特征变量集合中的各特征变量的逐月IV进行曲线拟合,对每个特征变量生成多条预测能力波动曲线;依次将每个特征变量作为当前特征变量,对所述当前特征变量的多条预测能力波动曲线的拟合均方根误差进行对比,判断最小的拟合均方根误差与当前特征变量的逐月IV均值的比值是否大于预设阈值,若大于则进一步判断所述当前特征变量的各逐月IV值是否单调,并在不单调时将所述当前特征变量剔除。
- 根据权利要求12所述的计算机设备,其中,所述IV值还包括整体IV值,所述处理器执行所述计算机可读指令实现所述对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值的步骤时,具体实现如下步骤:对所述第一特征变量集合中的特征变量赋予预设的固定权重;对所述第三特征变量集合中所述比值不大于所述预设阈值的特征变量,根据所述最小的拟合均方根误差对应的预测能力波动曲线,求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值;对第三特征变量集合中所述比值大于所述预设阈值、且对应的所述逐月IV值单调的特征变量,根据其距所述目标时间段最近的两个时间段的IV值求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值。
- 根据权利要求9至11任一项所述的计算机设备,其中,所述处理器执行所述计算机可读指令实现对所述特征变量进行预处理的步骤时,具体实现如下步骤:对所述特征变量的样本数据依次进行数据清洗、数据变量分箱和数值化编码操作。
- 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行如下步骤:获取与目标对象的行为相关的历史数据,从所述历史数据中提取多个维度的特征变量,并对所述特征变量进行预处理;对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合,其中所述第一特征变量集合中各特征变量跨时间的预测稳定性均高于所述第二特征变量集合中各特征变量跨时间的预测稳定性;对所述第二特征变量集合进行二次筛选得到第三特征变量集合,对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值;根据所述第一特征变量集合和对应的权重对预设的LightGBM树模型进行训练,得到第一训练模型并输出第一训练结果,根据所述第一特征变量集合、所述第三特征变量集和对应的权重对预设的LightGBM树模型进行训练,得到第二训练模型并输出第二训练结果;当所述第二训练结果与所述第一训练结果的比较结果满足预设条件时,输出所述第二训练模型,基于所述第一特征变量集合、所述第三特征变量集和所述第二训练模型对目标对象的行为进行预测。
- 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行,使得所述处理器还执行如下步骤:当所述第二训练结果与所述第一训练结果的比较结果不满足预设条件时,调整所述第三特征变量集合中的特征变量对应的权重系数,基于所述权重系数得到所述第三特征变量 集合中的特征变量的新的权重,再基于新的权重进行模型训练,之后进行所述第一训练结果和所述第二训练结果的比较,直到所述比较结果满足所述预设条件。
- 根据权利要求16所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行,使得所述处理器执行所述对预处理后的所述特征变量进行特征筛选,生成第一特征变量集合和第二特征变量集合的步骤时,具体执行如下步骤:将所述特征变量在多个时间段的样本数据作为训练样本集,并获取所述特征变量在目标时间段的样本数据作为预测样本集,基于所述训练样本集和所述预测样本集计算所述特征变量的IV值和PSI值,从所述原始特征变量集中筛选出所述IV值和所述PSI值满足第一阈值组的特征变量,生成第一特征变量集,并从所述原始特征变量集剩余的特征变量中筛选出所述IV值和所述PSI值满足第二阈值组的特征变量,生成第二特征变量集。
- 根据权利要求16或17所述的计算机可读存储介质,其中,所述IV值包括逐月IV值和逐月IV均值,所述计算机可读指令被所述处理器执行,使得所述处理器执行所述对所述第二特征变量集合进行二次筛选得到第三特征变量集合的步骤时,具体执行如下步骤:基于多个拟合函数对所述第二特征变量集合中的各特征变量的逐月IV进行曲线拟合,对每个特征变量生成多条预测能力波动曲线;依次将每个特征变量作为当前特征变量,对所述当前特征变量的多条预测能力波动曲线的拟合均方根误差进行对比,判断最小的拟合均方根误差与当前特征变量的逐月IV均值的比值是否大于预设阈值,若大于则进一步判断所述当前特征变量的各逐月IV值是否单调,并在不单调时将所述当前特征变量剔除。
- 根据权利要求18所述的计算机可读存储介质,其中,所述IV值还包括整体IV值,所述计算机可读指令被所述处理器执行,使得所述处理器执行所述对所述第三特征变量集合和所述第一特征变量集合中的特征变量采用不同的赋值方式分别进行权重赋值的步骤时,具体执行如下步骤:对所述第一特征变量集合中的特征变量赋予预设的固定权重;对所述第三特征变量集合中所述比值不大于所述预设阈值的特征变量,根据所述最小的拟合均方根误差对应的预测能力波动曲线,求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值;对第三特征变量集合中所述比值大于所述预设阈值、且对应的所述逐月IV值单调的特征变量,根据其距所述目标时间段最近的两个时间段的IV值求取其在所述目标时间段的IV值,基于得到的IV值和所述整体IV值进行权重赋值。
- 根据权利要求15至17任一项所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行,使得所述处理器执行对所述特征变量进行预处理的步骤时,具体执行如下步骤:对所述特征变量的样本数据依次进行数据清洗、数据变量分箱和数值化编码操作。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011487422.8A CN112508118B (zh) | 2020-12-16 | 2020-12-16 | 针对数据偏移的目标对象行为预测方法及其相关设备 |
CN202011487422.8 | 2020-12-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022126961A1 true WO2022126961A1 (zh) | 2022-06-23 |
Family
ID=74972721
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/090162 WO2022126961A1 (zh) | 2020-12-16 | 2021-04-27 | 针对数据偏移的目标对象行为预测方法及其相关设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112508118B (zh) |
WO (1) | WO2022126961A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115964942A (zh) * | 2022-12-19 | 2023-04-14 | 广东邦普循环科技有限公司 | 一种动力电池材料烧制系统加热组件老化预测方法及系统 |
CN116961729A (zh) * | 2023-08-09 | 2023-10-27 | 深圳市恩斯仪器设备有限公司 | 一种基于北斗的语音通信方法、系统、设备及介质 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112508118B (zh) * | 2020-12-16 | 2023-08-29 | 平安科技(深圳)有限公司 | 针对数据偏移的目标对象行为预测方法及其相关设备 |
CN112990583B (zh) * | 2021-03-19 | 2023-07-25 | 中国平安人寿保险股份有限公司 | 一种数据预测模型的入模特征确定方法及设备 |
CN113191806B (zh) * | 2021-04-30 | 2024-07-19 | 北京沃东天骏信息技术有限公司 | 确定流量调控目标的方法及装置 |
CN113240213B (zh) * | 2021-07-09 | 2021-10-08 | 平安科技(深圳)有限公司 | 基于神经网络和树模型的人员甄选方法、装置及设备 |
CN114511022B (zh) * | 2022-01-24 | 2022-12-27 | 百度在线网络技术(北京)有限公司 | 特征筛选、行为识别模型训练、异常行为识别方法及装置 |
CN118173278A (zh) * | 2024-03-05 | 2024-06-11 | 中山大学附属第六医院 | 性激素检测结果互认方法、装置、电子设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180348717A1 (en) * | 2017-06-02 | 2018-12-06 | Aspen Technology, Inc. | Computer System And Method For Building And Deploying Predictive Inferential Models Online |
CN109784373A (zh) * | 2018-12-17 | 2019-05-21 | 深圳魔数智擎科技有限公司 | 特征变量的筛选方法、计算机可读存储介质及计算机设备 |
CN111178639A (zh) * | 2019-12-31 | 2020-05-19 | 北京明略软件系统有限公司 | 一种基于多模型融合实现预测的方法及装置 |
CN112508118A (zh) * | 2020-12-16 | 2021-03-16 | 平安科技(深圳)有限公司 | 针对数据偏移的目标对象行为预测方法及其相关设备 |
CN112990583A (zh) * | 2021-03-19 | 2021-06-18 | 中国平安人寿保险股份有限公司 | 一种数据预测模型的入模特征确定方法及设备 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107633254A (zh) * | 2017-07-25 | 2018-01-26 | 平安科技(深圳)有限公司 | 建立预测模型的装置、方法及计算机可读存储介质 |
CN109858970B (zh) * | 2019-02-02 | 2021-07-02 | 中国银行股份有限公司 | 一种用户行为预测方法、装置及存储介质 |
CN111080360B (zh) * | 2019-12-13 | 2023-12-01 | 中诚信征信有限公司 | 行为预测方法、模型训练方法、装置、服务器及存储介质 |
CN112036483B (zh) * | 2020-08-31 | 2024-03-15 | 中国平安人寿保险股份有限公司 | 基于AutoML的对象预测分类方法、装置、计算机设备及存储介质 |
-
2020
- 2020-12-16 CN CN202011487422.8A patent/CN112508118B/zh active Active
-
2021
- 2021-04-27 WO PCT/CN2021/090162 patent/WO2022126961A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180348717A1 (en) * | 2017-06-02 | 2018-12-06 | Aspen Technology, Inc. | Computer System And Method For Building And Deploying Predictive Inferential Models Online |
CN109784373A (zh) * | 2018-12-17 | 2019-05-21 | 深圳魔数智擎科技有限公司 | 特征变量的筛选方法、计算机可读存储介质及计算机设备 |
CN111178639A (zh) * | 2019-12-31 | 2020-05-19 | 北京明略软件系统有限公司 | 一种基于多模型融合实现预测的方法及装置 |
CN112508118A (zh) * | 2020-12-16 | 2021-03-16 | 平安科技(深圳)有限公司 | 针对数据偏移的目标对象行为预测方法及其相关设备 |
CN112990583A (zh) * | 2021-03-19 | 2021-06-18 | 中国平安人寿保险股份有限公司 | 一种数据预测模型的入模特征确定方法及设备 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115964942A (zh) * | 2022-12-19 | 2023-04-14 | 广东邦普循环科技有限公司 | 一种动力电池材料烧制系统加热组件老化预测方法及系统 |
CN115964942B (zh) * | 2022-12-19 | 2023-12-12 | 广东邦普循环科技有限公司 | 一种动力电池材料烧制系统加热组件老化预测方法及系统 |
CN116961729A (zh) * | 2023-08-09 | 2023-10-27 | 深圳市恩斯仪器设备有限公司 | 一种基于北斗的语音通信方法、系统、设备及介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112508118B (zh) | 2023-08-29 |
CN112508118A (zh) | 2021-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022126961A1 (zh) | 针对数据偏移的目标对象行为预测方法及其相关设备 | |
CN112148987B (zh) | 基于目标对象活跃度的消息推送方法及相关设备 | |
WO2022142001A1 (zh) | 基于多评分卡融合的目标对象评价方法及其相关设备 | |
CN112990583B (zh) | 一种数据预测模型的入模特征确定方法及设备 | |
CN112287244A (zh) | 基于联邦学习的产品推荐方法、装置、计算机设备及介质 | |
CN112035549B (zh) | 数据挖掘方法、装置、计算机设备及存储介质 | |
US20220261591A1 (en) | Data processing method and apparatus | |
CN112036483B (zh) | 基于AutoML的对象预测分类方法、装置、计算机设备及存储介质 | |
CN112085087B (zh) | 业务规则生成的方法、装置、计算机设备及存储介质 | |
CN112182118B (zh) | 基于多数据源的目标对象预测方法及其相关设备 | |
CN112487021A (zh) | 业务数据的关联分析方法、装置及设备 | |
CN114219664B (zh) | 产品推荐方法、装置、计算机设备及存储介质 | |
CN113298360B (zh) | 一种用于资源分配的风险控制方法、装置和系统 | |
CN110197078B (zh) | 数据处理方法、装置、计算机可读介质及电子设备 | |
CN112199374B (zh) | 针对数据缺失的数据特征挖掘方法及其相关设备 | |
CN113791909A (zh) | 服务器容量调整方法、装置、计算机设备及存储介质 | |
CN115099875A (zh) | 基于决策树模型的数据分类方法及相关设备 | |
CN114925275A (zh) | 产品推荐方法、装置、计算机设备及存储介质 | |
CN113781247A (zh) | 协议数据推荐方法、装置、计算机设备及存储介质 | |
CN111784377B (zh) | 用于生成信息的方法和装置 | |
CN113936677A (zh) | 音色转换方法、装置、计算机设备及存储介质 | |
CN113760550A (zh) | 资源分配方法和资源分配装置 | |
US20230009941A1 (en) | Method of processing data for target model, electronic device, and storage medium | |
CN111327513B (zh) | 消息数据的推送方法、装置、计算机设备及存储介质 | |
CN115293266A (zh) | 信用评级方法、装置、电子设备以及计算机可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21904900 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21904900 Country of ref document: EP Kind code of ref document: A1 |