WO2021196843A1

WO2021196843A1 - Derived variable selection method and apparatus for risk identification model

Info

Publication number: WO2021196843A1
Application number: PCT/CN2021/073963
Authority: WO
Inventors: 付大鹏; 赵闻飙
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2020-03-31
Filing date: 2021-01-27
Publication date: 2021-10-07
Also published as: CN111461892A; CN111461892B

Abstract

A derived variable selection method and apparatus for a risk identification model, an electronic device, and a storage medium, relating to the field of risk identification. The method comprises: determining an updated seed pool of a target genetic algorithm model according to the target genetic algorithm model and the quality of a derived variable generated by a seed pool thereof (S11), wherein the updated seed pool comprises a parent sample set of N derived variables having optimal quality generated by the seed pool before updating; then according to a first parent sample accumulation variable set and a second parent sample accumulation variable set, determining a target derived variable set in the variation direction of the derived variable having the optimal quality (S13); and outputting the derived variables in the target derived variable set as the sample features of the risk identification model (S17).

Description

Derivative variable selection method and device for risk identification model

Technical field

The embodiments of this specification relate to the technical field of risk identification, and in particular, to a method and device for selecting derivative variables for a risk identification model.

Background technique

Identifying risk characteristics is a necessary function for protecting the interests of users in many current wealth management applications, electronic payment applications, and other scenarios that are highly sensitive to risks. Based on the above scenarios, user transactions and account risk control are highly antagonistic, corresponding to various types of risks such as embezzlement, fraud, cash out, cheating, money laundering, etc., groups and individuals such as black industry gangs and "wool parties" , There will be based on the existing risk control system, bypassing various risk identifications in order to embezzle money or illegal transactions. The reason is that the number and diversity of the risk characteristics of the training sample in the sample database for training the risk identification model are insufficient.

The way to increase the number and diversity of risk features is to use exhaustive methods to violently derive risk features, and then perform feature screening based on preset screening conditions (feature importance is greater than a preset threshold), which requires a lot of calculations Resource and time cost, and the quality of the obtained risk feature set is low.

Summary of the invention

The purpose of the embodiments of this specification is to provide a method and device for selecting a derivative variable for a risk identification model, so as to improve the selection efficiency and quality of the risk feature set.

In the first aspect, the embodiments of this specification provide a method for selecting derived variables for a risk identification model, including: determining the updated seed of the target genetic algorithm model according to the quality of the target genetic algorithm model and the derived variables generated by its seed pool Pool, where the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business. The updated seed pool includes the set of parents of N best-quality derivative variables generated by the seed pool before the update ; According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, where the first paternal cumulative variable set and the second paternal cumulative variable set It is the derived variable parent selected based on the seed pool after the update of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as the sample feature of the risk identification model; Among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including: according to the first paternal cumulative variable set, through the derivative variable paternal The matching model selects M second paternal parents matched by the first paternal parent in the second paternal cumulative variable set in the second paternal cumulative variable set to generate a candidate derived variable set; select N in the candidate derived variable set A derivative variable with the best quality is used as the target derivative variable set.

In the second aspect, the embodiment of this specification also provides a derivative variable selection device for a risk identification model, including: a seed pool determining module, which determines the target genetic algorithm model and the quality of the derivative variables generated by the seed pool according to the target genetic algorithm model. The updated seed pool of the algorithm model, where the quality of the derived variables is used to evaluate the contribution of the derived variables as the sample characteristics of the risk identification model of the target business. The updated seed pool includes the N best quality generated by the seed pool before the update The parent set of derived variables; the derived variable determination module determines the target derived variable set according to the first paternal cumulative variable set and the second paternal cumulative variable set to determine the target derived variable set with the best quality of the derived variable. Among them, the first paternal cumulative variable set The cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; the information output module, if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable The derivative variables in the set are used as the sample characteristics of the risk identification model; among them, the derivative variable determination module, specifically based on the first paternal cumulative variable set, selects the second paternal cumulative variable set from the second paternal cumulative variable set through the derivative variable paternal matching model The M second paternal parents matched by the first paternal parent in a paternal cumulative variable set are used to generate a candidate derived variable set; N derived variables with the best quality are selected from the candidate derived variable set as the target derived variable set.

In a third aspect, the embodiments of the present specification also provide an electronic device, including: a memory, on which a computer program is stored; a processor, used to execute the computer program in the memory to achieve: according to the target genetic algorithm model and its seed pool The quality of the generated derivative variables determines the updated seed pool of the target genetic algorithm model. The quality of the derivative variables is used to evaluate the contribution of the derivative variables as the sample characteristics of the target business risk identification model. The updated seed pool includes the pre-updated seed pool. The paternal set of N best-quality derivative variables generated by the seed pool; according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best derivative variable quality, where , The first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable Derivative variables in the variable set are used as the sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, Including: According to the first paternal cumulative variable set, the second paternal cumulative variable set is selected from the second paternal cumulative variable set through the derived variable paternal matching model, and M second paternals matched by the first paternal parent in the first paternal cumulative variable set are selected , To generate a set of candidate derived variables; select N derived variables with the best quality from the set of candidate derived variables as the target derived variable set.

In a fourth aspect, the embodiment of this specification also provides a storage medium on which a computer program is stored. The program is executed when the processor is executed: according to the target genetic algorithm model and the quality of the derived variables generated by its seed pool, the target is determined The updated seed pool of the genetic algorithm model, where the quality of the derived variables is used to evaluate the contribution of the derived variables as the sample characteristics of the risk identification model of the target business. The updated seed pool includes the N best quality seed pools generated before the update According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, where the first paternal cumulative variable set and The second paternal cumulative variable set is the derived variable parent selected based on the seed pool after the update of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as The sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best quality of the derivative variable, including: according to the first paternal cumulative variable Collection, select M second parents matched by the first parent in the first parent cumulative variable set in the second parent cumulative variable set through the derived variable parent matching model to generate the candidate derived variable set; in Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.

The above-mentioned at least one technical solution adopted in the embodiments of this specification can achieve the following beneficial effects: by determining the updated seed pool of the target genetic algorithm model according to the quality of the target genetic algorithm model and its seed pool derived variables, and the updated seed The pool includes the paternal set of N best-quality derived variables generated by the seed pool before the update; then, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target is determined in the direction of variation with the best quality of the derived variables Derivative variable set, where the first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parents selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence of the derived variables Condition, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model. In the end, the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.

Description of the drawings

The drawings described here are used to provide a further understanding of the embodiments of this specification, and constitute a part of the embodiments of this specification. Improper definition of the embodiment. In the attached picture:

Figure 1 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of this specification;

FIG. 2 is a schematic diagram of interaction between a service terminal and an electronic device according to an embodiment of this specification;

Figure 3 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of this specification

Figure 4 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of the specification;

5 is a block diagram of functional modules of a device for selecting a derivative variable for a risk identification model provided by an embodiment of this specification;

6 is a block diagram of functional modules of a device for selecting a derivative variable for a risk identification model provided by an embodiment of this specification;

FIG. 7 is a circuit connection block diagram of an electronic device provided by an embodiment of this specification.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of this specification clearer, the technical solutions of the embodiments of this specification will be clearly and completely described below in conjunction with specific embodiments of the embodiments of this specification and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present specification, rather than all the embodiments. Based on the embodiments in the embodiments of this specification, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the embodiments of this specification.

The following describes in detail the technical solutions provided by the various embodiments of the embodiments of the present specification with reference to the accompanying drawings.

Please refer to FIG. 1, an embodiment of this specification provides a method for selecting a derivative variable for a risk identification model, which is applied to an electronic device 100. The electronic device 100 can be, but is not limited to, a server. As shown in FIG. 2, the electronic device 100 is in communication connection with the service terminal 200 for data interaction. Among them, the business terminal 200 is installed with risk-sensitive application programs related to financial management, electronic payment, and the like. When the user conducts a transaction at the service terminal 200, the specific operation content of the generated transaction can be sent to the electronic device 100 and added to the seed pool. The method includes S11 to S17.

S11: Determine the updated seed pool of the target genetic algorithm model according to the quality of the derived variables generated by the target genetic algorithm model and its seed pool.

Among them, the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business. The updated seed pool includes the set of parents of N best-quality derivative variables generated by the seed pool before the update. It should be understood that N is a positive integer. For example, a total of 10,000 derived variables are generated, and the paternal set of N=the first 1,000 derived variables is selected to construct the updated seed pool; for another example, a total of 10,000 derived variables are generated, the quality score is 100, and the quality score is greater than A 70-point paternal collection of N derived variables constructs an updated seed pool. In addition, the target business can be a business that is highly sensitive to risks, such as payment business and money transfer business.

Optionally, the number of cumulative variables in the seed pool after the target genetic algorithm model is updated is equal to the number of cumulative variables in the seed pool before the target genetic algorithm model is updated. For example, the seed pool before the update has 1000 cumulative variables, and the seed pool after the update still has 1000 cumulative variables.

Or, optionally, the number of cumulative variables in the seed pool after the target genetic algorithm model is updated is smaller than the number of cumulative variables in the seed pool before the target genetic algorithm model is updated. For example, the seed pool before the update has 1000 cumulative variables, and the seed pool after the update can be 500 cumulative variables, or 200 cumulative variables, or other integer values less than 1000.

Specifically, the process of generating derivation variables according to the target genetic algorithm model and its seed pool may include: taking the cumulative variable set of the target business as the initial seed pool of the target genetic algorithm model, and taking the preset derivation strategy as the crossover operation of the target genetic algorithm model , Take the derived variable as the target genetic algorithm model's child, and take the quality of the derived variable as the target fitness of the child in the genetic algorithm model to select the parent of the derived variable whose quality is greater than the preset threshold from the generated set of derived variables Cumulative variables construct an updated seed pool as a mutation operation; generate a new derived variable set based on the updated seed pool as the iterative operation of the target genetic algorithm model, and the quality difference of the derived variable set obtained by two adjacent iterative operations is less than The preset threshold is used as the convergence condition of the target genetic algorithm model.

Through continuous iteration and update of the seed pool, the overall quality of the seed pool can be continuously improved. For example, the number of seeds with a quality score greater than a preset threshold in the initial seed pool accounts for 20%, and the next time the quality score is greater than the preset threshold The number of seeds accounts for 40%. Next time, the number of seeds with a quality score greater than the preset threshold accounts for 55%. In this way, the quality of the seed pool is gradually improved.

Among them, the structure of the cumulative variable can be, but is not limited to, five dimensions including: subject + object + function + time window + condition. For example, the cumulative variable: the number of times the user performs X operations in T days, the subject is the user ID, the object is the operation event ID, the function is count, the time window is T days, and the condition is operation type=X. Specifically, the cumulative variable of the target service may be the number of operations the user performs the target service within a set time, for example, the number of times the user performs the transfer service within 3 days, and the number of times the user performs the transfer service within 1 month. Understandably, cumulative variables have good identification effects and business explanatory properties for risk identification.

Derivative variables are derived based on at least two cumulative variables. For example, two cumulative variables whose content differs by one dimension (such as time dimension) are subjected to algorithmic operations (for example, the number of times a user performs a transfer business within a month, divided by the user’s The number of transfers performed within 3 days), a derivative variable is generated. Understandably, derivative variables also have good identification effects and business explanatory properties for risk identification. Of course, the above algorithm can be not only division, but also multiplication, addition, subtraction, etc., depending on actual needs.

S13: According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derived variable set is determined in the variation direction of the best quality of the derived variable.

Among them, the first paternal cumulative variable set and the second paternal cumulative variable set are derived variable paternals selected based on the updated seed pool of the target genetic algorithm model (that is, the updated seed pool is divided into the first paternal cumulative variable set And the second paternal cumulative variable set, assuming that the derivation strategy is division, each cumulative variable in the first paternal cumulative variable set is taken as the denominator, and each cumulative variable in the second paternal cumulative variable set is taken as the numerator ). Specifically, as shown in FIG. 3, S13 includes:

S31: According to the first paternal cumulative variable set, select M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set in the second paternal cumulative variable set through the derived variable paternal matching model To generate a set of candidate derived variables.

Among them, M is a positive integer.

S33: Select N derivative variables with the best quality from the candidate derivative variable set as the target derivative variable set.

For example, if the first paternal cumulative variable set includes 10 cumulative variables from A1 to A10, and the second paternal cumulative variable set includes 10 cumulative variables from B1-B10, then A1 to A10 are traversed, focusing on the first traversed variable. For the father, match one of the second fathers in B1-B10 through the derived variable father matching model, until all the first fathers and the second fathers are matched.

S15: Determine whether the target derivative variable set satisfies the quality convergence condition of the derivative variable, if so, execute S17, and optionally, if not, return to execute S11.

S17: Output the derivative variables in the target derivative variable set as the sample characteristics of the risk identification model.

When the target derived variable set meets the quality convergence condition of the derived variable, it means that the derived variable paternal matching model has stable output, so iterative training is no longer required.

The method for selecting derived variables for risk identification models determines the updated seed pool of the target genetic algorithm model based on the quality of the derived variables generated by the target genetic algorithm model and its seed pool, and the updated seed pool includes the pre-updated seeds The paternal set of N best-quality derived variables generated by the pool; then according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derived variable, where , The first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable Derived variables in the variable set are used as the sample characteristics of the risk identification model. In the end, the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.

Optionally, the derived variable paternal matching model is a reinforcement learning model, as shown in Figure 4, S12 specifically includes:

S41: Use the first parent in the first parent cumulative variable set as the state of the reinforcement learning model, and use the probability distribution of the selection of the second parent matched by the first parent as the optimal strategy of the reinforcement learning model. The selection of the second paternal parent is used as the action of the reinforcement learning model, and the quality of the derivative variables determined by the first paternal parent and the second paternal parent is used as the feedback income of the reinforcement learning model, and the reinforcement learning model is trained to obtain the second cumulative variable The second parent corresponding to each first parent in the set.

S43: Determine a candidate derivative variable set based on each first parent and the corresponding second parent in the first cumulative variable set.

Optionally, S13 may specifically determine whether the target derived variable set obtained based on the updated seed pool meets the quality convergence condition of the derived variable relative to the target derived variable set obtained based on the seed pool before the update, then output the target derived variable set Derived variables in as the sample characteristics of the risk identification model.

For example, it is determined whether the quality of the target derived variable set obtained by the updated seed pool is within a preset threshold range relative to the quality of the target derived variable set obtained based on the seed pool before the update.

Optionally, the target genetic algorithm model uses a cumulative variable set randomly selected from the cumulative variable set of the target business as the initial seed pool.

Optionally, the updated seed pool does not include cumulative variables other than the set of paternal parents in the seed pool before the update.

Specifically, the updated seed pool may include the generated parent set of the N best-quality derivative variables and the cumulative variable randomly selected from the cumulative variable set of the target business. Wherein, the proportion of the parent set of the generated N best-quality derivative variables is greater than or equal to the cumulative variable randomly selected from the cumulative variable set of the target business.

Referring to FIG. 5, an embodiment of this specification also provides a derivative variable selection device 500 for a risk identification model, which is applied to an electronic device 100. The electronic device 100 may be, but is not limited to, a server. As shown in FIG. 2, the electronic device 100 is in communication connection with the service terminal 200 for data interaction. Among them, the business terminal 200 is installed with risk-sensitive application programs related to financial management, electronic payment, and the like. When the user conducts a transaction at the service terminal 200, the specific operation content of the transaction can be sent to the electronic device 100 and added to the seed pool. It should be noted that the basic principles and technical effects of the device 500 for selecting derivative variables for risk identification models provided in the embodiments of this specification are the same as those of the above embodiments. For brief descriptions, the embodiments of this specification are not mentioned. Where, please refer to the corresponding content in the above-mentioned embodiment. The device 500 includes a seed pool determination module 501, a derivative variable determination module 502, and an information output module 503. Among them,

The seed pool determining module 501 determines the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, where the quality of the derived variables is used to evaluate the derived variables as risk identification of the target business The contribution of the sample characteristics of the model, the updated seed pool includes the set of parents of N best-quality derived variables generated by the seed pool before the update.

Optionally, the target genetic algorithm model uses a cumulative variable set randomly selected from the cumulative variable set of the target business as the initial seed pool. In addition, the updated seed pool does not include cumulative variables other than the set of paternal parents in the seed pool before the update.

The derivative variable determining module 502 determines the target derivative variable set according to the first paternal cumulative variable set and the second paternal cumulative variable set in the direction of variation of the best derivative variable quality, wherein the first paternal cumulative variable set and the second paternal cumulative variable set The paternal cumulative variable set is the derived variable paternal selected based on the updated seed pool of the target genetic algorithm model.

Optionally, both the first parent and the second parent include multiple dimensions, and the dimension value of one dimension is different between the first parent and the second parent.

The information output module 503, if the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as a sample feature of the risk identification model. in,

The derivative variable determination module 502, specifically based on the first paternal cumulative variable set, selects the M matched by the first paternal parent in the first paternal cumulative variable set in the second paternal cumulative variable set through the derived variable paternal matching model A second parent to generate a set of candidate derived variables; N derived variables with the best quality are selected from the set of candidate derived variables as the target derived variable set.

The device 500 for selecting derived variables for risk identification models can realize the following functions when executed: by determining the updated seed pool of the target genetic algorithm model according to the quality of the derived variables generated by the target genetic algorithm model and its seed pool, and updating The latter seed pool includes the paternal set of N best-quality derived variables generated by the seed pool before the update; then the cumulative variable set of the first paternal parent and the cumulative variable set of the second paternal parent are used to derive the variation with the best quality of the variable The direction determines the target derived variable set, where the first paternal cumulative variable set and the second paternal cumulative variable set are the derived variable parents selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set satisfies the derived variable The quality convergence condition of the output target derivative variable set is used as the sample feature of the risk identification model. In the end, the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.

Optionally, the derived variable parent matching model is a reinforcement learning model, the derived variable determining module uses the first parent in the first parent cumulative variable set as the state of the reinforcement learning model, and the second parent matching the first parent The probability distribution of the choice of the father is the optimal strategy of the reinforcement learning model, the choice of the second father is the action of the reinforcement learning model, and the quality of the derivative variables determined by the first father and the second father is used as the reinforcement learning The feedback benefit of the model, the reinforcement learning model is trained to obtain the second parent corresponding to each first parent in the third cumulative variable set; based on each first parent in the first cumulative variable set and the corresponding first parent Two paternal parents, determine the set of candidate derived variables.

Optionally, the information output module 503, if the target derived variable set obtained based on the updated seed pool satisfies the quality convergence condition of the derived variable with respect to the target derived variable set obtained based on the seed pool before the update, then output the target derived variable set Derived variables of, as the sample characteristics of the risk identification model.

Optionally, as shown in FIG. 6, the device 500 further includes: a process returning module 504, if the target derivative variable set does not meet the convergence condition, return the quality of the derivative variable generated according to the target genetic algorithm model and its seed pool, Steps to determine the updated seed pool of the target genetic algorithm model.

It should be noted that the execution subject of each step of the method provided in Embodiment 1 may be the same device, or the method may also be executed by different devices. For example, the execution subject of step 21 and step 22 can be device 1, and the execution subject of step 23 can be device 2. For another example, the execution subject of step 21 can be device 1, and the execution subject of step 22 and step 23 can be device 2. ;and many more.

The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown in order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Please refer to FIG. 7. At the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. Among them, the memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), or may also include non-volatile memory (non-volatile memory), such as at least one disk storage. Of course, the electronic device may also include hardware required by other services.

The processor, network interface, and memory can be connected to each other through an internal bus. The internal bus can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnection standard) bus, or an EISA (Extended) bus. Industry Standard Architecture, extended industry standard structure) bus, etc. The bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one bidirectional arrow is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.

Memory, used to store programs. Specifically, the program may include program code, and the program code includes computer operation instructions. The memory may include memory and non-volatile memory, and provide instructions and data to the processor.

The processor reads the corresponding computer program from the non-volatile memory to the memory and then runs it to form a derivative variable selection device for the risk identification model on a logical level. The processor executes the program stored in the memory, and is specifically configured to perform the following operations: according to the quality of the target genetic algorithm model and the derived variables generated by its seed pool, determine the updated seed pool of the target genetic algorithm model, wherein the derivative The quality of the variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business. The updated seed pool includes the parents of the N best-quality derivative variables generated by the seed pool before the update. Set; according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable set The cumulative variable set is based on the derived variable parent selected by the seed pool after the target genetic algorithm model is updated; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as risk identification The sample characteristics of the model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best derivative variable quality, including: according to the first paternal cumulative variable set, M second parents matched by the first parent in the first parent cumulative variable set are selected from the second parent cumulative variable set through the derived variable paternal matching model to generate the candidate derived variable set; Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.

The method performed by the device for selecting a derivative variable of a risk identification model disclosed in the embodiment shown in FIG. 1 of the embodiment of the present specification described above may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software. The above-mentioned processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of this specification can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of this specification can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.

The electronic device can also execute the method in FIG. 1 and realize the functions of the embodiment shown in FIG. 1 of the derivative variable selection device for the risk identification model, and the details of the embodiment in this specification will not be repeated here.

Of course, in addition to the software implementation, the electronic equipment in the embodiments of this specification does not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution body of the following processing flow is not limited to each logic Units can also be hardware or logic devices.

The embodiment of the present specification also proposes a computer-readable storage medium that stores one or more programs, the one or more programs include instructions, and the instructions are used in a portable electronic device that includes multiple application programs. When executed, the portable electronic device can be used to execute the method of the embodiment shown in FIG. 1, and is specifically used to perform the following operations: according to the target genetic algorithm model and the quality of the derived variables generated by its seed pool, determine the target genetic algorithm model after the update The seed pool of the seed pool, wherein the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business, and the updated seed pool includes N generated by the seed pool before the update The paternal set of derivative variables with the best quality; according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best derivative variable quality, wherein the first parent The current cumulative variable set and the second paternal cumulative variable set are derived variable parents selected based on the seed pool after the target genetic algorithm model is updated; if the target derived variable set meets the quality convergence condition of the derived variable, the target derived variable set is output The derivative variables in, are used as the sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including: According to the first paternal cumulative variable set, M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set are selected in the second paternal cumulative variable set through the derived variable paternal matching model, and A candidate derivative variable set is generated; N derivative variables with the best quality are selected from the candidate derivative variable set as the target derivative variable set.

In short, the above descriptions are only preferred embodiments of the embodiments of this specification, and are not used to limit the protection scope of the embodiments of this specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of this specification shall be included in the protection scope of the embodiments of this specification.

The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. Specifically, the computer can be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.

Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements includes not only those elements, but also Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

Claims

A method for selecting derived variables for risk identification models, including:

Determine the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, wherein the quality of the derived variable is used to evaluate the derived variable as the risk identification of the target business The contribution of the sample characteristics of the model, the updated seed pool includes the parent set of N best-quality derived variables generated by the seed pool before the update;

According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable The set is the parent of the derived variables selected based on the seed pool after the update of the target genetic algorithm model;

If the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model; among them,

According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including:

According to the first paternal cumulative variable set, M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set are selected in the second paternal cumulative variable set through the derived variable paternal matching model, and Generate a set of candidate derived variables;

Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.
The method according to claim 1, wherein the derived variable paternal matching model is a reinforcement learning model, the first paternal cumulative variable set is selected from the second paternal cumulative variable set through the derived variable paternal matching model The M second paternal parents matched by the first paternal parent in the cumulative variable set of the first paternal parent to generate a candidate derived variable set, including:

Taking the first paternal parent in the first paternal cumulative variable set as the state of the reinforcement learning model, and taking the probability distribution of the selection of the second paternal parent matched by the first paternal parent as the optimal strategy of the reinforcement learning model, The selection of the second parent is taken as the action of the reinforcement learning model, and the quality of the derivative variables determined by the first parent and the second parent is used as the feedback benefit of the reinforcement learning model, and the reinforcement learning model is performed Training to obtain the second parent corresponding to each first parent in the third cumulative variable set;

The candidate derivative variable set is determined based on each first parent and the corresponding second parent in the first cumulative variable set.
The method according to claim 1 or 2, wherein if the target derivative variable set satisfies the quality convergence condition of the derivative variable, outputting the derivative variable in the target derivative variable set as a sample feature of the risk identification model includes:

If the target derived variable set based on the updated seed pool meets the quality convergence condition of the derived variable relative to the target derived variable set based on the seed pool before the update, then the derived variable in the target derived variable set is output as a risk identification model Sample characteristics.
The method according to claim 1 or 2, wherein the target genetic algorithm model uses a cumulative variable set randomly selected from the cumulative variable set of the target business as the initial seed pool.
The method according to claim 1 or 2, wherein the updated seed pool does not include cumulative variables other than the set of paternal parents in the seed pool before the update.
The method according to claim 1 or 2, wherein each of the first parent and the second parent includes multiple dimensions, and there is a dimension between the first parent and the second parent. The value is different.
According to the method of claim 1 or 2, if the set of target derived variables does not meet the convergence condition, the quality of the derived variables generated according to the target genetic algorithm model and its seed pool is returned to determine the updated seed pool of the target genetic algorithm model step.
The method according to claim 1 or 2,

The number of cumulative variables in the seed pool after the target genetic algorithm model is updated is equal to the number of cumulative variables in the seed pool before the target genetic algorithm model is updated; or

The number of cumulative variables in the seed pool after the target genetic algorithm model is updated is smaller than the number of cumulative variables in the seed pool before the target genetic algorithm model is updated.
A device for selecting derivative variables for risk identification models, including:

The seed pool determining module determines the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by the seed pool, wherein the quality of the derived variable is used to evaluate the derived variable as the The contribution of the sample characteristics of the risk identification model of the target business, where the updated seed pool includes the parent set of N best-quality derivative variables generated by the seed pool before the update;

The derivative variable determining module determines the target derivative variable set according to the first paternal cumulative variable set and the second paternal cumulative variable set in the direction of variation of the best derivative variable quality, wherein the first paternal cumulative variable set and the second paternal cumulative variable set are The second paternal cumulative variable set is the derived variable paternal selected based on the updated seed pool of the target genetic algorithm model;

The information output module, if the target derivative variable set meets the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model; among them,

The derivative variable determination module specifically selects the first parent in the first parent cumulative variable set from the second parent cumulative variable set according to the first parent cumulative variable set through the derived variable parent matching model. M second parents to generate a candidate derivative variable set; from the candidate derivative variable set, N derivative variables with the best quality are selected as the target derivative variable set.
An electronic device including:

A memory on which a computer program is stored;

The processor is configured to execute the computer program in the memory to realize:

Determine the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, wherein the quality of the derived variable is used to evaluate the derived variable as the risk identification of the target business The contribution of the sample characteristics of the model, the updated seed pool includes the parent set of N best-quality derived variables generated by the seed pool before the update;

According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable The set is the parent of the derived variables selected based on the seed pool after the update of the target genetic algorithm model;

If the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model; among them,

According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including:

According to the first paternal cumulative variable set, M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set are selected in the second paternal cumulative variable set through the derived variable paternal matching model to Generate a set of candidate derived variables;

Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.
A storage medium on which a computer program is stored, which is realized when the program is executed by a processor:

Determine the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, wherein the quality of the derived variable is used to evaluate the derived variable as the risk identification of the target business The contribution of the sample characteristics of the model, the updated seed pool includes the parent set of N best-quality derived variables generated by the seed pool before the update;

According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable The set is the parent of the derived variables selected based on the seed pool after the update of the target genetic algorithm model;

If the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model; among them,

According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including:

According to the first paternal cumulative variable set, M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set are selected in the second paternal cumulative variable set through the derived variable paternal matching model, and Generate a set of candidate derived variables;

Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.