WO2021196843A1 - Derived variable selection method and apparatus for risk identification model - Google Patents

Derived variable selection method and apparatus for risk identification model Download PDF

Info

Publication number
WO2021196843A1
WO2021196843A1 PCT/CN2021/073963 CN2021073963W WO2021196843A1 WO 2021196843 A1 WO2021196843 A1 WO 2021196843A1 CN 2021073963 W CN2021073963 W CN 2021073963W WO 2021196843 A1 WO2021196843 A1 WO 2021196843A1
Authority
WO
WIPO (PCT)
Prior art keywords
paternal
variable
variable set
derived
cumulative
Prior art date
Application number
PCT/CN2021/073963
Other languages
French (fr)
Chinese (zh)
Inventor
付大鹏
赵闻飙
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021196843A1 publication Critical patent/WO2021196843A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing

Definitions

  • the embodiments of this specification relate to the technical field of risk identification, and in particular, to a method and device for selecting derivative variables for a risk identification model.
  • Identifying risk characteristics is a necessary function for protecting the interests of users in many current wealth management applications, electronic payment applications, and other scenarios that are highly sensitive to risks.
  • user transactions and account risk control are highly antagonistic, corresponding to various types of risks such as embezzlement, fraud, cash out, cheating, money laundering, etc., groups and individuals such as black industry gangs and "wool parties" .
  • risk control system bypassing various risk identifications in order to embezzle money or illegal transactions. The reason is that the number and diversity of the risk characteristics of the training sample in the sample database for training the risk identification model are insufficient.
  • the way to increase the number and diversity of risk features is to use exhaustive methods to violently derive risk features, and then perform feature screening based on preset screening conditions (feature importance is greater than a preset threshold), which requires a lot of calculations Resource and time cost, and the quality of the obtained risk feature set is low.
  • the purpose of the embodiments of this specification is to provide a method and device for selecting a derivative variable for a risk identification model, so as to improve the selection efficiency and quality of the risk feature set.
  • the embodiments of this specification provide a method for selecting derived variables for a risk identification model, including: determining the updated seed of the target genetic algorithm model according to the quality of the target genetic algorithm model and the derived variables generated by its seed pool Pool, where the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business.
  • the updated seed pool includes the set of parents of N best-quality derivative variables generated by the seed pool before the update ;
  • the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, where the first paternal cumulative variable set and the second paternal cumulative variable set It is the derived variable parent selected based on the seed pool after the update of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as the sample feature of the risk identification model;
  • the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including: according to the first paternal cumulative variable set, through the derivative variable paternal
  • the matching model selects M second paternal parents matched by the first paternal parent in the second paternal cumulative variable set in the second paternal cumulative variable set to
  • the embodiment of this specification also provides a derivative variable selection device for a risk identification model, including: a seed pool determining module, which determines the target genetic algorithm model and the quality of the derivative variables generated by the seed pool according to the target genetic algorithm model.
  • the updated seed pool of the algorithm model where the quality of the derived variables is used to evaluate the contribution of the derived variables as the sample characteristics of the risk identification model of the target business.
  • the updated seed pool includes the N best quality generated by the seed pool before the update
  • the parent set of derived variables the derived variable determination module determines the target derived variable set according to the first paternal cumulative variable set and the second paternal cumulative variable set to determine the target derived variable set with the best quality of the derived variable.
  • the first paternal cumulative variable set The cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; the information output module, if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable
  • the derivative variables in the set are used as the sample characteristics of the risk identification model; among them, the derivative variable determination module, specifically based on the first paternal cumulative variable set, selects the second paternal cumulative variable set from the second paternal cumulative variable set through the derivative variable paternal matching model
  • the M second paternal parents matched by the first paternal parent in a paternal cumulative variable set are used to generate a candidate derived variable set; N derived variables with the best quality are selected from the candidate derived variable set as the target derived variable set.
  • the embodiments of the present specification also provide an electronic device, including: a memory, on which a computer program is stored; a processor, used to execute the computer program in the memory to achieve: according to the target genetic algorithm model and its seed pool
  • the quality of the generated derivative variables determines the updated seed pool of the target genetic algorithm model.
  • the quality of the derivative variables is used to evaluate the contribution of the derivative variables as the sample characteristics of the target business risk identification model.
  • the updated seed pool includes the pre-updated seed pool.
  • the first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable Derivative variables in the variable set are used as the sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, Including: According to the first paternal cumulative variable set, the second paternal cumulative variable set is selected from the second paternal cumulative variable set through the derived variable paternal matching model, and M second paternals matched by the first paternal parent in the first paternal cumulative variable
  • the embodiment of this specification also provides a storage medium on which a computer program is stored.
  • the program is executed when the processor is executed: according to the target genetic algorithm model and the quality of the derived variables generated by its seed pool, the target is determined The updated seed pool of the genetic algorithm model, where the quality of the derived variables is used to evaluate the contribution of the derived variables as the sample characteristics of the risk identification model of the target business.
  • the updated seed pool includes the N best quality seed pools generated before the update
  • the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, where the first paternal cumulative variable set and The second paternal cumulative variable set is the derived variable parent selected based on the seed pool after the update of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as The sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best quality of the derivative variable, including: according to the first paternal cumulative variable Collection, select M second parents matched by the first parent in the first parent cumulative variable set in the second parent cumulative variable set through the derived variable parent matching model to generate the candidate derived variable set; in Select N derived variables with the best quality from the candidate derived variable set as
  • the above-mentioned at least one technical solution adopted in the embodiments of this specification can achieve the following beneficial effects: by determining the updated seed pool of the target genetic algorithm model according to the quality of the target genetic algorithm model and its seed pool derived variables, and the updated seed
  • the pool includes the paternal set of N best-quality derived variables generated by the seed pool before the update; then, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target is determined in the direction of variation with the best quality of the derived variables Derivative variable set, where the first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parents selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence of the derived variables Condition, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model.
  • the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.
  • Figure 1 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of this specification
  • FIG. 2 is a schematic diagram of interaction between a service terminal and an electronic device according to an embodiment of this specification
  • Figure 3 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of this specification
  • Figure 4 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of the specification
  • FIG. 5 is a block diagram of functional modules of a device for selecting a derivative variable for a risk identification model provided by an embodiment of this specification;
  • FIG. 6 is a block diagram of functional modules of a device for selecting a derivative variable for a risk identification model provided by an embodiment of this specification;
  • FIG. 7 is a circuit connection block diagram of an electronic device provided by an embodiment of this specification.
  • an embodiment of this specification provides a method for selecting a derivative variable for a risk identification model, which is applied to an electronic device 100.
  • the electronic device 100 can be, but is not limited to, a server.
  • the electronic device 100 is in communication connection with the service terminal 200 for data interaction.
  • the business terminal 200 is installed with risk-sensitive application programs related to financial management, electronic payment, and the like.
  • the specific operation content of the generated transaction can be sent to the electronic device 100 and added to the seed pool.
  • the method includes S11 to S17.
  • S11 Determine the updated seed pool of the target genetic algorithm model according to the quality of the derived variables generated by the target genetic algorithm model and its seed pool.
  • the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business.
  • the updated seed pool includes the set of parents of N best-quality derivative variables generated by the seed pool before the update.
  • N is a positive integer.
  • the target business can be a business that is highly sensitive to risks, such as payment business and money transfer business.
  • the number of cumulative variables in the seed pool after the target genetic algorithm model is updated is equal to the number of cumulative variables in the seed pool before the target genetic algorithm model is updated.
  • the seed pool before the update has 1000 cumulative variables
  • the seed pool after the update still has 1000 cumulative variables.
  • the number of cumulative variables in the seed pool after the target genetic algorithm model is updated is smaller than the number of cumulative variables in the seed pool before the target genetic algorithm model is updated.
  • the seed pool before the update has 1000 cumulative variables
  • the seed pool after the update can be 500 cumulative variables, or 200 cumulative variables, or other integer values less than 1000.
  • the process of generating derivation variables according to the target genetic algorithm model and its seed pool may include: taking the cumulative variable set of the target business as the initial seed pool of the target genetic algorithm model, and taking the preset derivation strategy as the crossover operation of the target genetic algorithm model , Take the derived variable as the target genetic algorithm model's child, and take the quality of the derived variable as the target fitness of the child in the genetic algorithm model to select the parent of the derived variable whose quality is greater than the preset threshold from the generated set of derived variables Cumulative variables construct an updated seed pool as a mutation operation; generate a new derived variable set based on the updated seed pool as the iterative operation of the target genetic algorithm model, and the quality difference of the derived variable set obtained by two adjacent iterative operations is less than
  • the preset threshold is used as the convergence condition of the target genetic algorithm model.
  • the overall quality of the seed pool can be continuously improved. For example, the number of seeds with a quality score greater than a preset threshold in the initial seed pool accounts for 20%, and the next time the quality score is greater than the preset threshold The number of seeds accounts for 40%. Next time, the number of seeds with a quality score greater than the preset threshold accounts for 55%. In this way, the quality of the seed pool is gradually improved.
  • the structure of the cumulative variable can be, but is not limited to, five dimensions including: subject + object + function + time window + condition.
  • the cumulative variable the number of times the user performs X operations in T days
  • the subject is the user ID
  • the object is the operation event ID
  • the function is count
  • the time window is T days
  • the cumulative variable of the target service may be the number of operations the user performs the target service within a set time, for example, the number of times the user performs the transfer service within 3 days, and the number of times the user performs the transfer service within 1 month. Understandably, cumulative variables have good identification effects and business explanatory properties for risk identification.
  • Derivative variables are derived based on at least two cumulative variables. For example, two cumulative variables whose content differs by one dimension (such as time dimension) are subjected to algorithmic operations (for example, the number of times a user performs a transfer business within a month, divided by the user’s The number of transfers performed within 3 days), a derivative variable is generated. Understandably, derivative variables also have good identification effects and business explanatory properties for risk identification.
  • the above algorithm can be not only division, but also multiplication, addition, subtraction, etc., depending on actual needs.
  • the target derived variable set is determined in the variation direction of the best quality of the derived variable.
  • the first paternal cumulative variable set and the second paternal cumulative variable set are derived variable paternals selected based on the updated seed pool of the target genetic algorithm model (that is, the updated seed pool is divided into the first paternal cumulative variable set And the second paternal cumulative variable set, assuming that the derivation strategy is division, each cumulative variable in the first paternal cumulative variable set is taken as the denominator, and each cumulative variable in the second paternal cumulative variable set is taken as the numerator ).
  • S13 includes:
  • M is a positive integer.
  • the first paternal cumulative variable set includes 10 cumulative variables from A1 to A10
  • the second paternal cumulative variable set includes 10 cumulative variables from B1-B10
  • A1 to A10 are traversed, focusing on the first traversed variable.
  • For the father match one of the second fathers in B1-B10 through the derived variable father matching model, until all the first fathers and the second fathers are matched.
  • S15 Determine whether the target derivative variable set satisfies the quality convergence condition of the derivative variable, if so, execute S17, and optionally, if not, return to execute S11.
  • S17 Output the derivative variables in the target derivative variable set as the sample characteristics of the risk identification model.
  • the target derived variable set meets the quality convergence condition of the derived variable, it means that the derived variable paternal matching model has stable output, so iterative training is no longer required.
  • the method for selecting derived variables for risk identification models determines the updated seed pool of the target genetic algorithm model based on the quality of the derived variables generated by the target genetic algorithm model and its seed pool, and the updated seed pool includes the pre-updated seeds
  • the paternal set of N best-quality derived variables generated by the pool then according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derived variable, where ,
  • the first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable Derived variables in the variable set are used as the sample characteristics of the risk identification model.
  • the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.
  • the derived variable paternal matching model is a reinforcement learning model, as shown in Figure 4, S12 specifically includes:
  • S41 Use the first parent in the first parent cumulative variable set as the state of the reinforcement learning model, and use the probability distribution of the selection of the second parent matched by the first parent as the optimal strategy of the reinforcement learning model.
  • the selection of the second paternal parent is used as the action of the reinforcement learning model, and the quality of the derivative variables determined by the first paternal parent and the second paternal parent is used as the feedback income of the reinforcement learning model, and the reinforcement learning model is trained to obtain the second cumulative variable The second parent corresponding to each first parent in the set.
  • S43 Determine a candidate derivative variable set based on each first parent and the corresponding second parent in the first cumulative variable set.
  • S13 may specifically determine whether the target derived variable set obtained based on the updated seed pool meets the quality convergence condition of the derived variable relative to the target derived variable set obtained based on the seed pool before the update, then output the target derived variable set Derived variables in as the sample characteristics of the risk identification model.
  • the quality of the target derived variable set obtained by the updated seed pool is within a preset threshold range relative to the quality of the target derived variable set obtained based on the seed pool before the update.
  • the target genetic algorithm model uses a cumulative variable set randomly selected from the cumulative variable set of the target business as the initial seed pool.
  • the updated seed pool does not include cumulative variables other than the set of paternal parents in the seed pool before the update.
  • the updated seed pool may include the generated parent set of the N best-quality derivative variables and the cumulative variable randomly selected from the cumulative variable set of the target business.
  • the proportion of the parent set of the generated N best-quality derivative variables is greater than or equal to the cumulative variable randomly selected from the cumulative variable set of the target business.
  • an embodiment of this specification also provides a derivative variable selection device 500 for a risk identification model, which is applied to an electronic device 100.
  • the electronic device 100 may be, but is not limited to, a server.
  • the electronic device 100 is in communication connection with the service terminal 200 for data interaction.
  • the business terminal 200 is installed with risk-sensitive application programs related to financial management, electronic payment, and the like.
  • the specific operation content of the transaction can be sent to the electronic device 100 and added to the seed pool.
  • the device 500 includes a seed pool determination module 501, a derivative variable determination module 502, and an information output module 503. Among them,
  • the seed pool determining module 501 determines the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, where the quality of the derived variables is used to evaluate the derived variables as risk identification of the target business
  • the contribution of the sample characteristics of the model, the updated seed pool includes the set of parents of N best-quality derived variables generated by the seed pool before the update.
  • the target genetic algorithm model uses a cumulative variable set randomly selected from the cumulative variable set of the target business as the initial seed pool.
  • the updated seed pool does not include cumulative variables other than the set of paternal parents in the seed pool before the update.
  • the derivative variable determining module 502 determines the target derivative variable set according to the first paternal cumulative variable set and the second paternal cumulative variable set in the direction of variation of the best derivative variable quality, wherein the first paternal cumulative variable set and the second paternal cumulative variable set
  • the paternal cumulative variable set is the derived variable paternal selected based on the updated seed pool of the target genetic algorithm model.
  • both the first parent and the second parent include multiple dimensions, and the dimension value of one dimension is different between the first parent and the second parent.
  • the information output module 503 if the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as a sample feature of the risk identification model. in,
  • the derivative variable determination module 502 specifically based on the first paternal cumulative variable set, selects the M matched by the first paternal parent in the first paternal cumulative variable set in the second paternal cumulative variable set through the derived variable paternal matching model A second parent to generate a set of candidate derived variables; N derived variables with the best quality are selected from the set of candidate derived variables as the target derived variable set.
  • the device 500 for selecting derived variables for risk identification models can realize the following functions when executed: by determining the updated seed pool of the target genetic algorithm model according to the quality of the derived variables generated by the target genetic algorithm model and its seed pool, and updating The latter seed pool includes the paternal set of N best-quality derived variables generated by the seed pool before the update; then the cumulative variable set of the first paternal parent and the cumulative variable set of the second paternal parent are used to derive the variation with the best quality of the variable
  • the direction determines the target derived variable set, where the first paternal cumulative variable set and the second paternal cumulative variable set are the derived variable parents selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set satisfies the derived variable
  • the quality convergence condition of the output target derivative variable set is used as the sample feature of the risk identification model.
  • the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.
  • the derived variable parent matching model is a reinforcement learning model
  • the derived variable determining module uses the first parent in the first parent cumulative variable set as the state of the reinforcement learning model, and the second parent matching the first parent
  • the probability distribution of the choice of the father is the optimal strategy of the reinforcement learning model
  • the choice of the second father is the action of the reinforcement learning model
  • the quality of the derivative variables determined by the first father and the second father is used as the reinforcement learning
  • the reinforcement learning model is trained to obtain the second parent corresponding to each first parent in the third cumulative variable set; based on each first parent in the first cumulative variable set and the corresponding first parent Two paternal parents, determine the set of candidate derived variables.
  • the information output module 503 if the target derived variable set obtained based on the updated seed pool satisfies the quality convergence condition of the derived variable with respect to the target derived variable set obtained based on the seed pool before the update, then output the target derived variable set Derived variables of, as the sample characteristics of the risk identification model.
  • the device 500 further includes: a process returning module 504, if the target derivative variable set does not meet the convergence condition, return the quality of the derivative variable generated according to the target genetic algorithm model and its seed pool, Steps to determine the updated seed pool of the target genetic algorithm model.
  • the execution subject of each step of the method provided in Embodiment 1 may be the same device, or the method may also be executed by different devices.
  • the execution subject of step 21 and step 22 can be device 1, and the execution subject of step 23 can be device 2.
  • the execution subject of step 21 can be device 1, and the execution subject of step 22 and step 23 can be device 2. ;and many more.
  • Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Please refer to FIG. 7.
  • the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory.
  • the memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), or may also include non-volatile memory (non-volatile memory), such as at least one disk storage.
  • RAM random access memory
  • non-volatile memory such as at least one disk storage.
  • the electronic device may also include hardware required by other services.
  • the processor, network interface, and memory can be connected to each other through an internal bus.
  • the internal bus can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnection standard) bus, or an EISA (Extended) bus. Industry Standard Architecture, extended industry standard structure) bus, etc.
  • the bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one bidirectional arrow is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory may include memory and non-volatile memory, and provide instructions and data to the processor.
  • the processor reads the corresponding computer program from the non-volatile memory to the memory and then runs it to form a derivative variable selection device for the risk identification model on a logical level.
  • the processor executes the program stored in the memory, and is specifically configured to perform the following operations: according to the quality of the target genetic algorithm model and the derived variables generated by its seed pool, determine the updated seed pool of the target genetic algorithm model, wherein the derivative
  • the quality of the variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business.
  • the updated seed pool includes the parents of the N best-quality derivative variables generated by the seed pool before the update.
  • the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable set
  • the cumulative variable set is based on the derived variable parent selected by the seed pool after the target genetic algorithm model is updated; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as risk identification
  • the sample characteristics of the model among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best derivative variable quality, including: according to the first paternal cumulative variable set, M second parents matched by the first parent in the first parent cumulative variable set are selected from the second parent cumulative variable set through the derived variable paternal matching model to generate the candidate derived variable set; Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.
  • the method performed by the device for selecting a derivative variable of a risk identification model disclosed in the embodiment shown in FIG. 1 of the embodiment of the present specification described above may be applied to a processor or implemented by a processor.
  • the processor may be an integrated circuit chip with signal processing capabilities.
  • each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the above-mentioned processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of this specification can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the electronic device can also execute the method in FIG. 1 and realize the functions of the embodiment shown in FIG. 1 of the derivative variable selection device for the risk identification model, and the details of the embodiment in this specification will not be repeated here.
  • the electronic equipment in the embodiments of this specification does not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution body of the following processing flow is not limited to each logic Units can also be hardware or logic devices.
  • the embodiment of the present specification also proposes a computer-readable storage medium that stores one or more programs, the one or more programs include instructions, and the instructions are used in a portable electronic device that includes multiple application programs. When executed, the portable electronic device can be used to execute the method of the embodiment shown in FIG.
  • the target genetic algorithm model and the quality of the derived variables generated by its seed pool determine the target genetic algorithm model after the update
  • the seed pool of the seed pool wherein the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business, and the updated seed pool includes N generated by the seed pool before the update
  • the paternal set of derivative variables with the best quality according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best derivative variable quality, wherein the first parent
  • the current cumulative variable set and the second paternal cumulative variable set are derived variable parents selected based on the seed pool after the target genetic algorithm model is updated; if the target derived variable set meets the quality convergence condition of the derived variable, the target derived variable set is output
  • the derivative variables in are used as the sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set,
  • a typical implementation device is a computer.
  • the computer can be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
  • Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Physiology (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Human Resources & Organizations (AREA)
  • Operations Research (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Technology Law (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A derived variable selection method and apparatus for a risk identification model, an electronic device, and a storage medium, relating to the field of risk identification. The method comprises: determining an updated seed pool of a target genetic algorithm model according to the target genetic algorithm model and the quality of a derived variable generated by a seed pool thereof (S11), wherein the updated seed pool comprises a parent sample set of N derived variables having optimal quality generated by the seed pool before updating; then according to a first parent sample accumulation variable set and a second parent sample accumulation variable set, determining a target derived variable set in the variation direction of the derived variable having the optimal quality (S13); and outputting the derived variables in the target derived variable set as the sample features of the risk identification model (S17).

Description

用于风险识别模型的衍生变量选择方法和装置Derivative variable selection method and device for risk identification model 技术领域Technical field
本说明书实施例涉及风险识别技术领域,尤其涉及一种用于风险识别模型的衍生变量选择方法和装置。The embodiments of this specification relate to the technical field of risk identification, and in particular, to a method and device for selecting derivative variables for a risk identification model.
背景技术Background technique
对风险特征进行识别,是当前很多理财应用程序、电子支付应用程序等对风险敏感度比较高的场景必备的对用户的利益进行保护的功能。基于上述的场景,用户的交易和账户风控有着极强的对抗性,对应着盗用、欺诈、套现、作弊、洗钱等多种多样的风险类型,黑产团伙、“羊毛党”等群体和个人,会有针对现有风控体系,绕过各种风险识别以盗用钱财或违规交易。究其原因,对风险识别模型进行训练的样本数据库的作为训练样本的风险特征的数量及多样性存在不足。Identifying risk characteristics is a necessary function for protecting the interests of users in many current wealth management applications, electronic payment applications, and other scenarios that are highly sensitive to risks. Based on the above scenarios, user transactions and account risk control are highly antagonistic, corresponding to various types of risks such as embezzlement, fraud, cash out, cheating, money laundering, etc., groups and individuals such as black industry gangs and "wool parties" , There will be based on the existing risk control system, bypassing various risk identifications in order to embezzle money or illegal transactions. The reason is that the number and diversity of the risk characteristics of the training sample in the sample database for training the risk identification model are insufficient.
增加风险特征的数量及多样性的方式为:利用穷举方法进行对风险特征进行暴力衍生,然后基于预设的筛选条件(特征重要度大于预设的阈值)进行特征筛选,需要消耗大量的计算资源及时间成本,并且得到的风险特征集合的质量偏低。The way to increase the number and diversity of risk features is to use exhaustive methods to violently derive risk features, and then perform feature screening based on preset screening conditions (feature importance is greater than a preset threshold), which requires a lot of calculations Resource and time cost, and the quality of the obtained risk feature set is low.
发明内容Summary of the invention
本说明书实施例的目的是提供一种用于风险识别模型的衍生变量选择方法和装置,以提高风险特征集合的选择效率和质量。The purpose of the embodiments of this specification is to provide a method and device for selecting a derivative variable for a risk identification model, so as to improve the selection efficiency and quality of the risk feature set.
第一方面,本说明书实施例提供了一种用于风险识别模型的衍生变量选择方法,包括:根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,衍生变量的质量用于评估衍生变量作为目标业务的风险识别模型的样本特征的贡献,更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本;如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,包括:根据第一父本累积变量 集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;在候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。In the first aspect, the embodiments of this specification provide a method for selecting derived variables for a risk identification model, including: determining the updated seed of the target genetic algorithm model according to the quality of the target genetic algorithm model and the derived variables generated by its seed pool Pool, where the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business. The updated seed pool includes the set of parents of N best-quality derivative variables generated by the seed pool before the update ; According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, where the first paternal cumulative variable set and the second paternal cumulative variable set It is the derived variable parent selected based on the seed pool after the update of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as the sample feature of the risk identification model; Among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including: according to the first paternal cumulative variable set, through the derivative variable paternal The matching model selects M second paternal parents matched by the first paternal parent in the second paternal cumulative variable set in the second paternal cumulative variable set to generate a candidate derived variable set; select N in the candidate derived variable set A derivative variable with the best quality is used as the target derivative variable set.
第二方面,本说明书实施例还提供了一种用于风险识别模型的衍生变量选择装置,包括:种子池确定模块,根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,衍生变量的质量用于评估衍生变量作为目标业务的风险识别模型的样本特征的贡献,更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;衍生变量确定模块,根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本;信息输出模块,如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,衍生变量确定模块,具体根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;在候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。In the second aspect, the embodiment of this specification also provides a derivative variable selection device for a risk identification model, including: a seed pool determining module, which determines the target genetic algorithm model and the quality of the derivative variables generated by the seed pool according to the target genetic algorithm model. The updated seed pool of the algorithm model, where the quality of the derived variables is used to evaluate the contribution of the derived variables as the sample characteristics of the risk identification model of the target business. The updated seed pool includes the N best quality generated by the seed pool before the update The parent set of derived variables; the derived variable determination module determines the target derived variable set according to the first paternal cumulative variable set and the second paternal cumulative variable set to determine the target derived variable set with the best quality of the derived variable. Among them, the first paternal cumulative variable set The cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; the information output module, if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable The derivative variables in the set are used as the sample characteristics of the risk identification model; among them, the derivative variable determination module, specifically based on the first paternal cumulative variable set, selects the second paternal cumulative variable set from the second paternal cumulative variable set through the derivative variable paternal matching model The M second paternal parents matched by the first paternal parent in a paternal cumulative variable set are used to generate a candidate derived variable set; N derived variables with the best quality are selected from the candidate derived variable set as the target derived variable set.
第三方面,本说明书实施例还提供一种电子设备,包括:存储器,其上存储有计算机程序;处理器,用于执行存储器中的计算机程序,以实现:根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,衍生变量的质量用于评估衍生变量作为目标业务的风险识别模型的样本特征的贡献,更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本;如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,包括:根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;在候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。In a third aspect, the embodiments of the present specification also provide an electronic device, including: a memory, on which a computer program is stored; a processor, used to execute the computer program in the memory to achieve: according to the target genetic algorithm model and its seed pool The quality of the generated derivative variables determines the updated seed pool of the target genetic algorithm model. The quality of the derivative variables is used to evaluate the contribution of the derivative variables as the sample characteristics of the target business risk identification model. The updated seed pool includes the pre-updated seed pool. The paternal set of N best-quality derivative variables generated by the seed pool; according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best derivative variable quality, where , The first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable Derivative variables in the variable set are used as the sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, Including: According to the first paternal cumulative variable set, the second paternal cumulative variable set is selected from the second paternal cumulative variable set through the derived variable paternal matching model, and M second paternals matched by the first paternal parent in the first paternal cumulative variable set are selected , To generate a set of candidate derived variables; select N derived variables with the best quality from the set of candidate derived variables as the target derived variable set.
第四方面,本说明书实施例还提供了一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现:根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,衍生变量的质量用于评估衍生变量作为目标业务的风险识别模型的样本特征的贡献,更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本;如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,包括:根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;在候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。In a fourth aspect, the embodiment of this specification also provides a storage medium on which a computer program is stored. The program is executed when the processor is executed: according to the target genetic algorithm model and the quality of the derived variables generated by its seed pool, the target is determined The updated seed pool of the genetic algorithm model, where the quality of the derived variables is used to evaluate the contribution of the derived variables as the sample characteristics of the risk identification model of the target business. The updated seed pool includes the N best quality seed pools generated before the update According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, where the first paternal cumulative variable set and The second paternal cumulative variable set is the derived variable parent selected based on the seed pool after the update of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as The sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best quality of the derivative variable, including: according to the first paternal cumulative variable Collection, select M second parents matched by the first parent in the first parent cumulative variable set in the second parent cumulative variable set through the derived variable parent matching model to generate the candidate derived variable set; in Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.
本说明书实施例采用的上述至少一个技术方案能够达到以下有益效果:通过根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,且更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;然后根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本;如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征。最终实现直接通过模型即可生成风险识别模型的样本特征,节省了大量的计算资源及时间成本,再者,通过对种子池的不断优化及对确定衍生变量质量最优的变异方向地不断优化,最终得到的风险特征集合的质量高。The above-mentioned at least one technical solution adopted in the embodiments of this specification can achieve the following beneficial effects: by determining the updated seed pool of the target genetic algorithm model according to the quality of the target genetic algorithm model and its seed pool derived variables, and the updated seed The pool includes the paternal set of N best-quality derived variables generated by the seed pool before the update; then, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target is determined in the direction of variation with the best quality of the derived variables Derivative variable set, where the first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parents selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence of the derived variables Condition, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model. In the end, the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.
附图说明Description of the drawings
此处所说明的附图用来提供对本说明书实施例的进一步理解,构成本说明书实施例的一部分,本说明书实施例的示意性实施例及其说明用于解释本说明书实施例,并不构成对本说明书实施例的不当限定。在附图中:The drawings described here are used to provide a further understanding of the embodiments of this specification, and constitute a part of the embodiments of this specification. Improper definition of the embodiment. In the attached picture:
图1为本说明书的一种实施例提供的用于风险识别模型的衍生变量选择方法的流程图;Figure 1 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of this specification;
图2为本说明书的一种实施例提供的业务终端与电子设备的交互示意图;FIG. 2 is a schematic diagram of interaction between a service terminal and an electronic device according to an embodiment of this specification;
图3为本说明书的一种实施例提供的用于风险识别模型的衍生变量选择方法的流程图Figure 3 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of this specification
图4为本说明书的一种实施例提供的用于风险识别模型的衍生变量选择方法的流程图;Figure 4 is a flowchart of a method for selecting derivative variables for a risk identification model provided by an embodiment of the specification;
图5为本说明书的一种实施例提供的用于风险识别模型的衍生变量选择装置的功能模块框图;5 is a block diagram of functional modules of a device for selecting a derivative variable for a risk identification model provided by an embodiment of this specification;
图6为本说明书的一种实施例提供的用于风险识别模型的衍生变量选择装置的功能模块框图;6 is a block diagram of functional modules of a device for selecting a derivative variable for a risk identification model provided by an embodiment of this specification;
图7为本说明书的一种实施例提供的电子设备的电路连接框图。FIG. 7 is a circuit connection block diagram of an electronic device provided by an embodiment of this specification.
具体实施方式Detailed ways
为使本说明书实施例的目的、技术方案和优点更加清楚,下面将结合本说明书实施例具体实施例及相应的附图对本说明书实施例技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本说明书实施例一部分实施例,而不是全部的实施例。基于本说明书实施例中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本说明书实施例保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of this specification clearer, the technical solutions of the embodiments of this specification will be clearly and completely described below in conjunction with specific embodiments of the embodiments of this specification and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present specification, rather than all the embodiments. Based on the embodiments in the embodiments of this specification, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the embodiments of this specification.
以下结合附图,详细说明本说明书实施例各实施例提供的技术方案。The following describes in detail the technical solutions provided by the various embodiments of the embodiments of the present specification with reference to the accompanying drawings.
请参阅图1,本说明书实施例提供了一种用于风险识别模型的衍生变量选择方法,应用于电子设备100,电子设备100可以为但不限于是服务器。如图2所示,电子设备100与业务终端200通信连接,以便进行数据交互。其中,业务终端200安装有与理财、电子支付等相关的对风险敏感度的应用程序。当用户在业务终端200进行交易时,可以将产生交易的具体操作内容发送至电子设备100,并加入种子池。所述方法包括S11~S17。Please refer to FIG. 1, an embodiment of this specification provides a method for selecting a derivative variable for a risk identification model, which is applied to an electronic device 100. The electronic device 100 can be, but is not limited to, a server. As shown in FIG. 2, the electronic device 100 is in communication connection with the service terminal 200 for data interaction. Among them, the business terminal 200 is installed with risk-sensitive application programs related to financial management, electronic payment, and the like. When the user conducts a transaction at the service terminal 200, the specific operation content of the generated transaction can be sent to the electronic device 100 and added to the seed pool. The method includes S11 to S17.
S11:根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池。S11: Determine the updated seed pool of the target genetic algorithm model according to the quality of the derived variables generated by the target genetic algorithm model and its seed pool.
其中,衍生变量的质量用于评估衍生变量作为目标业务的风险识别模型的样本特征的贡献,更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合。应理解,N为正整数。例如,总共生成了10000个衍生变量,选择N=前1000的衍生变 量的父本集合构建更新后的种子池;再例如,总共生成了10000个衍生变量,质量的满分为100,选择质量分大于70分的N个衍生变量的父本集合构建更新后的种子池。另外,目标业务可以为支付业务、转账业务等对风险敏感度高的业务。Among them, the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business. The updated seed pool includes the set of parents of N best-quality derivative variables generated by the seed pool before the update. It should be understood that N is a positive integer. For example, a total of 10,000 derived variables are generated, and the paternal set of N=the first 1,000 derived variables is selected to construct the updated seed pool; for another example, a total of 10,000 derived variables are generated, the quality score is 100, and the quality score is greater than A 70-point paternal collection of N derived variables constructs an updated seed pool. In addition, the target business can be a business that is highly sensitive to risks, such as payment business and money transfer business.
可选地,目标遗传算法模型更新后的种子池的累积变量的数量等于目标遗传算法模型更新前的种子池的累积变量的数量。例如,更新前的种子池有1000个累积变量,更新后的种子池还是有1000个累积变量。Optionally, the number of cumulative variables in the seed pool after the target genetic algorithm model is updated is equal to the number of cumulative variables in the seed pool before the target genetic algorithm model is updated. For example, the seed pool before the update has 1000 cumulative variables, and the seed pool after the update still has 1000 cumulative variables.
或者,可选地,目标遗传算法模型更新后的种子池的累积变量的数量小于目标遗传算法模型更新前的种子池的累积变量的数量。例如,更新前的种子池有1000个累积变量,更新后的种子池可以是500个累积变量,或者是200个累积变量,或者是其它小于1000的整数值。Or, optionally, the number of cumulative variables in the seed pool after the target genetic algorithm model is updated is smaller than the number of cumulative variables in the seed pool before the target genetic algorithm model is updated. For example, the seed pool before the update has 1000 cumulative variables, and the seed pool after the update can be 500 cumulative variables, or 200 cumulative variables, or other integer values less than 1000.
具体地,根据目标遗传算法模型及其种子池生成的衍生变量过程可以包括:以目标业务的累积变量集合作为目标遗传算法模型的初始种子池,以预设衍生策略为目标遗传算法模型的交叉操作,以衍生变量为目标遗传算法模型的子本,以衍生变量的质量为目标遗传算法模型中子本的适应度,以从生成的衍生变量集合中选择质量大于预设阈值的衍生变量的父本累积变量构建更新的种子池作为变异操作;以根据更新的种子池生成新的衍生变量集合作为目标遗传算法模型的迭代操作,以相邻的两次迭代操作得到的衍生变量集合的质量差值小于预设的阈值作为目标遗传算法模型的收敛条件。Specifically, the process of generating derivation variables according to the target genetic algorithm model and its seed pool may include: taking the cumulative variable set of the target business as the initial seed pool of the target genetic algorithm model, and taking the preset derivation strategy as the crossover operation of the target genetic algorithm model , Take the derived variable as the target genetic algorithm model's child, and take the quality of the derived variable as the target fitness of the child in the genetic algorithm model to select the parent of the derived variable whose quality is greater than the preset threshold from the generated set of derived variables Cumulative variables construct an updated seed pool as a mutation operation; generate a new derived variable set based on the updated seed pool as the iterative operation of the target genetic algorithm model, and the quality difference of the derived variable set obtained by two adjacent iterative operations is less than The preset threshold is used as the convergence condition of the target genetic algorithm model.
通过不断迭代更新种子池,可以不断提高种子池的整体质量,例如,初始的种子池中质量分大于预设的阈值的种子的数量占比为20%,下一次质量分大于预设的阈值的种子的数量占比为40%,再下次,质量分大于预设的阈值的种子的数量占比为55%,如此,逐步提高种子池的质量。Through continuous iteration and update of the seed pool, the overall quality of the seed pool can be continuously improved. For example, the number of seeds with a quality score greater than a preset threshold in the initial seed pool accounts for 20%, and the next time the quality score is greater than the preset threshold The number of seeds accounts for 40%. Next time, the number of seeds with a quality score greater than the preset threshold accounts for 55%. In this way, the quality of the seed pool is gradually improved.
其中,累积变量的构成方式可以为但不限于包括:主体+客体+函数+时间窗+条件五个维度。比如,累积变量:用户T天内做X操作的次数,主体是用户ID,客体是操作事件ID,函数是count,时间窗是T天,条件是操作类型=X。具体地,目标业务的累积变量可以为用户在设定时间内执行目标业务的操作次数,例如,用户在3天内执行转账业务的次数,用户在1个月内执行转账业务的次数。可以理解地,累积变量对风险识别具有良好识别效果和业务解释性。Among them, the structure of the cumulative variable can be, but is not limited to, five dimensions including: subject + object + function + time window + condition. For example, the cumulative variable: the number of times the user performs X operations in T days, the subject is the user ID, the object is the operation event ID, the function is count, the time window is T days, and the condition is operation type=X. Specifically, the cumulative variable of the target service may be the number of operations the user performs the target service within a set time, for example, the number of times the user performs the transfer service within 3 days, and the number of times the user performs the transfer service within 1 month. Understandably, cumulative variables have good identification effects and business explanatory properties for risk identification.
衍生变量基于至少两个累积变量衍生生成,例如,将内容相差一个维度(如时间维度)的两个累积变量进行算法操作(如,用户在1个月内执行转账业务的次数,除以用 户在3天内执行转账业务的次数),生成一个衍生变量。可以理解地,衍生变量也对风险识别具有良好识别效果和业务解释性。当然地,上述的算法不仅仅可以为相除、也可以为相乘、相加、相减等操作,具体根据实际的需求而定。Derivative variables are derived based on at least two cumulative variables. For example, two cumulative variables whose content differs by one dimension (such as time dimension) are subjected to algorithmic operations (for example, the number of times a user performs a transfer business within a month, divided by the user’s The number of transfers performed within 3 days), a derivative variable is generated. Understandably, derivative variables also have good identification effects and business explanatory properties for risk identification. Of course, the above algorithm can be not only division, but also multiplication, addition, subtraction, etc., depending on actual needs.
S13:根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合。S13: According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derived variable set is determined in the variation direction of the best quality of the derived variable.
其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本(即将更新后的种子池划分为第一父本累积变量集合和第二父本累积变量集合,假设衍生策略为相除,则将第一父本累积变量集合中的每个累积变量当做分母,将第二父本累积变量集合中的每个累积变量当做分子)。具体地,如图3所示,在S13包括:Among them, the first paternal cumulative variable set and the second paternal cumulative variable set are derived variable paternals selected based on the updated seed pool of the target genetic algorithm model (that is, the updated seed pool is divided into the first paternal cumulative variable set And the second paternal cumulative variable set, assuming that the derivation strategy is division, each cumulative variable in the first paternal cumulative variable set is taken as the denominator, and each cumulative variable in the second paternal cumulative variable set is taken as the numerator ). Specifically, as shown in FIG. 3, S13 includes:
S31:根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合。S31: According to the first paternal cumulative variable set, select M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set in the second paternal cumulative variable set through the derived variable paternal matching model To generate a set of candidate derived variables.
其中,M为正整数。Among them, M is a positive integer.
S33:在候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。S33: Select N derivative variables with the best quality from the candidate derivative variable set as the target derivative variable set.
例如,第一父本累积变量集合包括A1到A10的10个累积变量,第二父本累积变量集合中包括B1-B10的10个累积变量,则遍历A1到A10,针对当前遍历到的第一父本,通过衍生变量父本匹配模型匹配B1-B10中的其中一个第二父本,直到所有的第一父本与第二父本匹配完毕。For example, if the first paternal cumulative variable set includes 10 cumulative variables from A1 to A10, and the second paternal cumulative variable set includes 10 cumulative variables from B1-B10, then A1 to A10 are traversed, focusing on the first traversed variable. For the father, match one of the second fathers in B1-B10 through the derived variable father matching model, until all the first fathers and the second fathers are matched.
S15:判断目标衍生变量集合是否满足衍生变量的质量收敛条件,如果是,则执行S17,可选地,如果否,则返回执行S11。S15: Determine whether the target derivative variable set satisfies the quality convergence condition of the derivative variable, if so, execute S17, and optionally, if not, return to execute S11.
S17:输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征。S17: Output the derivative variables in the target derivative variable set as the sample characteristics of the risk identification model.
当目标衍生变量集合满足衍生变量的质量收敛条件时,说明衍生变量父本匹配模型已经具有稳定的输出,因此,不再迭代训练。When the target derived variable set meets the quality convergence condition of the derived variable, it means that the derived variable paternal matching model has stable output, so iterative training is no longer required.
该用于风险识别模型的衍生变量选择方法,通过根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,且更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;然后根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集 合,其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本;如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征。最终实现直接通过模型即可生成风险识别模型的样本特征,节省了大量的计算资源及时间成本,再者,通过对种子池的不断优化及对确定衍生变量质量最优的变异方向地不断优化,最终得到的风险特征集合的质量高。The method for selecting derived variables for risk identification models determines the updated seed pool of the target genetic algorithm model based on the quality of the derived variables generated by the target genetic algorithm model and its seed pool, and the updated seed pool includes the pre-updated seeds The paternal set of N best-quality derived variables generated by the pool; then according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derived variable, where , The first paternal cumulative variable set and the second paternal cumulative variable set are derived variables parent selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set meets the quality convergence condition of the derived variable, output the target derived variable Derived variables in the variable set are used as the sample characteristics of the risk identification model. In the end, the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.
可选地,衍生变量父本匹配模型为强化学习模型,如图4所示,S12具体包括:Optionally, the derived variable paternal matching model is a reinforcement learning model, as shown in Figure 4, S12 specifically includes:
S41:以第一父本累积变量集合中的第一父本作为强化学习模型的状态,以第一父本匹配的第二父本的选择的概率分布作为强化学习模型的最优策略,以第二父本的选择作为强化学习模型的动作,以由第一父本和第二父本确定的衍生变量的质量作为强化学习模型的反馈收益,对强化学习模型进行训练,以得到第二累积变量集合中的各第一父本对应的第二父本。S41: Use the first parent in the first parent cumulative variable set as the state of the reinforcement learning model, and use the probability distribution of the selection of the second parent matched by the first parent as the optimal strategy of the reinforcement learning model. The selection of the second paternal parent is used as the action of the reinforcement learning model, and the quality of the derivative variables determined by the first paternal parent and the second paternal parent is used as the feedback income of the reinforcement learning model, and the reinforcement learning model is trained to obtain the second cumulative variable The second parent corresponding to each first parent in the set.
S43:基于第一累积变量集合中的各第一父本及对应的第二父本,确定候选衍生变量集合。S43: Determine a candidate derivative variable set based on each first parent and the corresponding second parent in the first cumulative variable set.
可选地,S13具体可以为判断基于更新后的种子池得到的目标衍生变量集合相对于基于更新前种子池得到的目标衍生变量集合,是否满足衍生变量的质量收敛条件,则输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征。Optionally, S13 may specifically determine whether the target derived variable set obtained based on the updated seed pool meets the quality convergence condition of the derived variable relative to the target derived variable set obtained based on the seed pool before the update, then output the target derived variable set Derived variables in as the sample characteristics of the risk identification model.
例如,判断更新后的种子池得到的目标衍生变量集合的质量相对于基于更新前种子池得到的目标衍生变量集合的质量,是否在预设的阈值范围内。For example, it is determined whether the quality of the target derived variable set obtained by the updated seed pool is within a preset threshold range relative to the quality of the target derived variable set obtained based on the seed pool before the update.
可选地,目标遗传算法模型以从目标业务的累积变量集合中随机选择的累积变量集合为初始种子池。Optionally, the target genetic algorithm model uses a cumulative variable set randomly selected from the cumulative variable set of the target business as the initial seed pool.
可选地,更新后的种子池不包括更新前种子池中父本集合以外的累积变量。Optionally, the updated seed pool does not include cumulative variables other than the set of paternal parents in the seed pool before the update.
具体地,更新后的种子池可以包括生成的N个质量最优的衍生变量的父本集合以及从目标业务的累积变量集合中随机选择的累积变量。其中,生成的N个质量最优的衍生变量的父本集合的占比大于等于从目标业务的累积变量集合中随机选择的累积变量。Specifically, the updated seed pool may include the generated parent set of the N best-quality derivative variables and the cumulative variable randomly selected from the cumulative variable set of the target business. Wherein, the proportion of the parent set of the generated N best-quality derivative variables is greater than or equal to the cumulative variable randomly selected from the cumulative variable set of the target business.
请参阅图5,本说明书实施例还提供了一种用于风险识别模型的衍生变量选择装置500,应用于电子设备100,电子设备100可以为但不限于是服务器。如图2所示,电子设备100与业务终端200通信连接,以便进行数据交互。其中,业务终端200安装有与理财、电子支付等相关的对风险敏感度的应用程序。当用户在业务终端200进行交易时, 可以将产生交易的具体操作内容发送至电子设备100,并加入种子池。需要说明的是,本说明书实施例所提供的用于风险识别模型的衍生变量选择装置500,其基本原理及产生的技术效果和上述实施例相同,为简要描述,本说明书实施例部分未提及之处,可参考上述的实施例中相应内容。所述装置500包括种子池确定模块501、衍生变量确定模块502、信息输出模块503,其中,Referring to FIG. 5, an embodiment of this specification also provides a derivative variable selection device 500 for a risk identification model, which is applied to an electronic device 100. The electronic device 100 may be, but is not limited to, a server. As shown in FIG. 2, the electronic device 100 is in communication connection with the service terminal 200 for data interaction. Among them, the business terminal 200 is installed with risk-sensitive application programs related to financial management, electronic payment, and the like. When the user conducts a transaction at the service terminal 200, the specific operation content of the transaction can be sent to the electronic device 100 and added to the seed pool. It should be noted that the basic principles and technical effects of the device 500 for selecting derivative variables for risk identification models provided in the embodiments of this specification are the same as those of the above embodiments. For brief descriptions, the embodiments of this specification are not mentioned. Where, please refer to the corresponding content in the above-mentioned embodiment. The device 500 includes a seed pool determination module 501, a derivative variable determination module 502, and an information output module 503. Among them,
种子池确定模块501,根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,衍生变量的质量用于评估衍生变量作为目标业务的风险识别模型的样本特征的贡献,更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合。The seed pool determining module 501 determines the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, where the quality of the derived variables is used to evaluate the derived variables as risk identification of the target business The contribution of the sample characteristics of the model, the updated seed pool includes the set of parents of N best-quality derived variables generated by the seed pool before the update.
可选地,目标遗传算法模型以从目标业务的累积变量集合中随机选择的累积变量集合为初始种子池。另外,更新后的种子池不包括更新前种子池中父本集合以外的累积变量。Optionally, the target genetic algorithm model uses a cumulative variable set randomly selected from the cumulative variable set of the target business as the initial seed pool. In addition, the updated seed pool does not include cumulative variables other than the set of paternal parents in the seed pool before the update.
衍生变量确定模块502,根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本。The derivative variable determining module 502 determines the target derivative variable set according to the first paternal cumulative variable set and the second paternal cumulative variable set in the direction of variation of the best derivative variable quality, wherein the first paternal cumulative variable set and the second paternal cumulative variable set The paternal cumulative variable set is the derived variable paternal selected based on the updated seed pool of the target genetic algorithm model.
可选地,第一父本、第二父本均包括多个维度,第一父本与第二父本之间有一个维度的维度值不同。Optionally, both the first parent and the second parent include multiple dimensions, and the dimension value of one dimension is different between the first parent and the second parent.
信息输出模块503,如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征。其中,The information output module 503, if the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as a sample feature of the risk identification model. in,
衍生变量确定模块502,具体根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;在候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。The derivative variable determination module 502, specifically based on the first paternal cumulative variable set, selects the M matched by the first paternal parent in the first paternal cumulative variable set in the second paternal cumulative variable set through the derived variable paternal matching model A second parent to generate a set of candidate derived variables; N derived variables with the best quality are selected from the set of candidate derived variables as the target derived variable set.
该用于风险识别模型的衍生变量选择装置500在执行时可以实现如下功能:通过根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,且更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;然后根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,第一父本累积变量集合和第二父本累积变量集合是基于目标遗传算法模型更新后的种子池选择的衍生变量父本;如果目标衍生变量 集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征。最终实现直接通过模型即可生成风险识别模型的样本特征,节省了大量的计算资源及时间成本,再者,通过对种子池的不断优化及对确定衍生变量质量最优的变异方向地不断优化,最终得到的风险特征集合的质量高。The device 500 for selecting derived variables for risk identification models can realize the following functions when executed: by determining the updated seed pool of the target genetic algorithm model according to the quality of the derived variables generated by the target genetic algorithm model and its seed pool, and updating The latter seed pool includes the paternal set of N best-quality derived variables generated by the seed pool before the update; then the cumulative variable set of the first paternal parent and the cumulative variable set of the second paternal parent are used to derive the variation with the best quality of the variable The direction determines the target derived variable set, where the first paternal cumulative variable set and the second paternal cumulative variable set are the derived variable parents selected based on the updated seed pool of the target genetic algorithm model; if the target derived variable set satisfies the derived variable The quality convergence condition of the output target derivative variable set is used as the sample feature of the risk identification model. In the end, the sample characteristics of the risk identification model can be generated directly through the model, saving a lot of computing resources and time costs. Furthermore, through continuous optimization of the seed pool and continuous optimization of the mutation direction to determine the best quality of the derived variables, The final risk feature set is of high quality.
可选地,衍生变量父本匹配模型为强化学习模型,衍生变量确定模块,以第一父本累积变量集合中的第一父本作为强化学习模型的状态,以第一父本匹配的第二父本的选择的概率分布作为强化学习模型的最优策略,以第二父本的选择作为强化学习模型的动作,以由第一父本和第二父本确定的衍生变量的质量作为强化学习模型的反馈收益,对强化学习模型进行训练,以得到第三累积变量集合中的各第一父本对应的第二父本;基于第一累积变量集合中的各第一父本及对应的第二父本,确定候选衍生变量集合。Optionally, the derived variable parent matching model is a reinforcement learning model, the derived variable determining module uses the first parent in the first parent cumulative variable set as the state of the reinforcement learning model, and the second parent matching the first parent The probability distribution of the choice of the father is the optimal strategy of the reinforcement learning model, the choice of the second father is the action of the reinforcement learning model, and the quality of the derivative variables determined by the first father and the second father is used as the reinforcement learning The feedback benefit of the model, the reinforcement learning model is trained to obtain the second parent corresponding to each first parent in the third cumulative variable set; based on each first parent in the first cumulative variable set and the corresponding first parent Two paternal parents, determine the set of candidate derived variables.
可选地,信息输出模块503,如果基于更新后的种子池得到的目标衍生变量集合相对于基于更新前种子池得到的目标衍生变量集合满足衍生变量的质量收敛条件,则输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征。Optionally, the information output module 503, if the target derived variable set obtained based on the updated seed pool satisfies the quality convergence condition of the derived variable with respect to the target derived variable set obtained based on the seed pool before the update, then output the target derived variable set Derived variables of, as the sample characteristics of the risk identification model.
可选地,如图6所示,所述装置500还包括:进程返回模块504,如果目标衍生变量集合不满足收敛条件,则返回根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池的步骤。Optionally, as shown in FIG. 6, the device 500 further includes: a process returning module 504, if the target derivative variable set does not meet the convergence condition, return the quality of the derivative variable generated according to the target genetic algorithm model and its seed pool, Steps to determine the updated seed pool of the target genetic algorithm model.
需要说明的是,实施例1所提供方法的各步骤的执行主体均可以是同一设备,或者,该方法也由不同设备作为执行主体。比如,步骤21和步骤22的执行主体可以为设备1,步骤23的执行主体可以为设备2;又比如,步骤21的执行主体可以为设备1,步骤22和步骤23的执行主体可以为设备2;等等。It should be noted that the execution subject of each step of the method provided in Embodiment 1 may be the same device, or the method may also be executed by different devices. For example, the execution subject of step 21 and step 22 can be device 1, and the execution subject of step 23 can be device 2. For another example, the execution subject of step 21 can be device 1, and the execution subject of step 22 and step 23 can be device 2. ;and many more.
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown in order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
图7是本说明书实施例的一个实施例电子设备的结构示意图。请参考图7,在硬件层面,该电子设备包括处理器,可选地还包括内部总线、网络接口、存储器。其中,存储器可能包含内存,例如高速随机存取存储器(Random-Access Memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少1个磁盘存储器等。当然, 该电子设备还可能包括其他业务所需要的硬件。Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Please refer to FIG. 7. At the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. Among them, the memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), or may also include non-volatile memory (non-volatile memory), such as at least one disk storage. Of course, the electronic device may also include hardware required by other services.
处理器、网络接口和存储器可以通过内部总线相互连接,该内部总线可以是ISA(Industry Standard Architecture,工业标准体系结构)总线、PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。The processor, network interface, and memory can be connected to each other through an internal bus. The internal bus can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnection standard) bus, or an EISA (Extended) bus. Industry Standard Architecture, extended industry standard structure) bus, etc. The bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one bidirectional arrow is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.
存储器,用于存放程序。具体地,程序可以包括程序代码,所述程序代码包括计算机操作指令。存储器可以包括内存和非易失性存储器,并向处理器提供指令和数据。Memory, used to store programs. Specifically, the program may include program code, and the program code includes computer operation instructions. The memory may include memory and non-volatile memory, and provide instructions and data to the processor.
处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面上形成用于风险识别模型的衍生变量选择装置。处理器,执行存储器所存放的程序,并具体用于执行以下操作:根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,所述衍生变量的质量用于评估所述衍生变量作为所述目标业务的风险识别模型的样本特征的贡献,所述更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,所述第一父本累积变量集合和第二父本累积变量集合是基于所述目标遗传算法模型更新后的种子池选择的衍生变量父本;如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,包括:根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;在所述候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。The processor reads the corresponding computer program from the non-volatile memory to the memory and then runs it to form a derivative variable selection device for the risk identification model on a logical level. The processor executes the program stored in the memory, and is specifically configured to perform the following operations: according to the quality of the target genetic algorithm model and the derived variables generated by its seed pool, determine the updated seed pool of the target genetic algorithm model, wherein the derivative The quality of the variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business. The updated seed pool includes the parents of the N best-quality derivative variables generated by the seed pool before the update. Set; according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable set The cumulative variable set is based on the derived variable parent selected by the seed pool after the target genetic algorithm model is updated; if the target derived variable set meets the quality convergence condition of the derived variable, the derived variable in the target derived variable set is output as risk identification The sample characteristics of the model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best derivative variable quality, including: according to the first paternal cumulative variable set, M second parents matched by the first parent in the first parent cumulative variable set are selected from the second parent cumulative variable set through the derived variable paternal matching model to generate the candidate derived variable set; Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.
上述如本说明书实施例图1所示实施例揭示的用于风险识别模型的衍生变量选择装置执行的方法可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array, FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本说明书实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本说明书实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。The method performed by the device for selecting a derivative variable of a risk identification model disclosed in the embodiment shown in FIG. 1 of the embodiment of the present specification described above may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software. The above-mentioned processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of this specification can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of this specification can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
该电子设备还可执行图1的方法,并实现用于风险识别模型的衍生变量选择装置在图1所示实施例的功能,本说明书实施例在此不再赘述。The electronic device can also execute the method in FIG. 1 and realize the functions of the embodiment shown in FIG. 1 of the derivative variable selection device for the risk identification model, and the details of the embodiment in this specification will not be repeated here.
当然,除了软件实现方式之外,本说明书实施例的电子设备并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。Of course, in addition to the software implementation, the electronic equipment in the embodiments of this specification does not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution body of the following processing flow is not limited to each logic Units can also be hardware or logic devices.
本说明书实施例还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被包括多个应用程序的便携式电子设备执行时,能够使该便携式电子设备执行图1所示实施例的方法,并具体用于执行以下操作:根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,所述衍生变量的质量用于评估所述衍生变量作为所述目标业务的风险识别模型的样本特征的贡献,所述更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,所述第一父本累积变量集合和第二父本累积变量集合是基于所述目标遗传算法模型更新后的种子池选择的衍生变量父本;如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,包括:根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;在所述候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。The embodiment of the present specification also proposes a computer-readable storage medium that stores one or more programs, the one or more programs include instructions, and the instructions are used in a portable electronic device that includes multiple application programs. When executed, the portable electronic device can be used to execute the method of the embodiment shown in FIG. 1, and is specifically used to perform the following operations: according to the target genetic algorithm model and the quality of the derived variables generated by its seed pool, determine the target genetic algorithm model after the update The seed pool of the seed pool, wherein the quality of the derivative variable is used to evaluate the contribution of the derivative variable as the sample feature of the risk identification model of the target business, and the updated seed pool includes N generated by the seed pool before the update The paternal set of derivative variables with the best quality; according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation of the best derivative variable quality, wherein the first parent The current cumulative variable set and the second paternal cumulative variable set are derived variable parents selected based on the seed pool after the target genetic algorithm model is updated; if the target derived variable set meets the quality convergence condition of the derived variable, the target derived variable set is output The derivative variables in, are used as the sample characteristics of the risk identification model; among them, according to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including: According to the first paternal cumulative variable set, M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set are selected in the second paternal cumulative variable set through the derived variable paternal matching model, and A candidate derivative variable set is generated; N derivative variables with the best quality are selected from the candidate derivative variable set as the target derivative variable set.
总之,以上所述仅为本说明书实施例的较佳实施例而已,并非用于限定本说明书实施例的保护范围。凡在本说明书实施例的精神和原则之内,所作的任何修改、等同替换、 改进等,均应包含在本说明书实施例的保护范围之内。In short, the above descriptions are only preferred embodiments of the embodiments of this specification, and are not used to limit the protection scope of the embodiments of this specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of this specification shall be included in the protection scope of the embodiments of this specification.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. Specifically, the computer can be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements includes not only those elements, but also Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

Claims (11)

  1. 一种用于风险识别模型的衍生变量选择方法,包括:A method for selecting derived variables for risk identification models, including:
    根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,所述衍生变量的质量用于评估所述衍生变量作为所述目标业务的风险识别模型的样本特征的贡献,所述更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;Determine the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, wherein the quality of the derived variable is used to evaluate the derived variable as the risk identification of the target business The contribution of the sample characteristics of the model, the updated seed pool includes the parent set of N best-quality derived variables generated by the seed pool before the update;
    根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,所述第一父本累积变量集合和第二父本累积变量集合是基于所述目标遗传算法模型更新后的种子池选择的衍生变量父本;According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable The set is the parent of the derived variables selected based on the seed pool after the update of the target genetic algorithm model;
    如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,If the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model; among them,
    根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,包括:According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including:
    根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;According to the first paternal cumulative variable set, M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set are selected in the second paternal cumulative variable set through the derived variable paternal matching model, and Generate a set of candidate derived variables;
    在所述候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.
  2. 根据权利要求1所述的方法,所述衍生变量父本匹配模型为强化学习模型,所述根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合,包括:The method according to claim 1, wherein the derived variable paternal matching model is a reinforcement learning model, the first paternal cumulative variable set is selected from the second paternal cumulative variable set through the derived variable paternal matching model The M second paternal parents matched by the first paternal parent in the cumulative variable set of the first paternal parent to generate a candidate derived variable set, including:
    以第一父本累积变量集合中的第一父本作为所述强化学习模型的状态,以第一父本匹配的第二父本的选择的概率分布作为所述强化学习模型的最优策略,以第二父本的选择作为所述强化学习模型的动作,以由第一父本和第二父本确定的衍生变量的质量作为所述强化学习模型的反馈收益,对所述强化学习模型进行训练,以得到第三累积变量集合中的各第一父本对应的第二父本;Taking the first paternal parent in the first paternal cumulative variable set as the state of the reinforcement learning model, and taking the probability distribution of the selection of the second paternal parent matched by the first paternal parent as the optimal strategy of the reinforcement learning model, The selection of the second parent is taken as the action of the reinforcement learning model, and the quality of the derivative variables determined by the first parent and the second parent is used as the feedback benefit of the reinforcement learning model, and the reinforcement learning model is performed Training to obtain the second parent corresponding to each first parent in the third cumulative variable set;
    基于第一累积变量集合中的各第一父本及对应的第二父本,确定所述候选衍生变量集合。The candidate derivative variable set is determined based on each first parent and the corresponding second parent in the first cumulative variable set.
  3. 根据权利要求1或2所述的方法,所述如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征,包括:The method according to claim 1 or 2, wherein if the target derivative variable set satisfies the quality convergence condition of the derivative variable, outputting the derivative variable in the target derivative variable set as a sample feature of the risk identification model includes:
    如果基于更新后的种子池得到的目标衍生变量集合相对于基于更新前种子池得到的目标衍生变量集合满足衍生变量的质量收敛条件,则输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征。If the target derived variable set based on the updated seed pool meets the quality convergence condition of the derived variable relative to the target derived variable set based on the seed pool before the update, then the derived variable in the target derived variable set is output as a risk identification model Sample characteristics.
  4. 根据权利要求1或2所述的方法,所述目标遗传算法模型以从目标业务的累积变量集合中随机选择的累积变量集合为初始种子池。The method according to claim 1 or 2, wherein the target genetic algorithm model uses a cumulative variable set randomly selected from the cumulative variable set of the target business as the initial seed pool.
  5. 根据权利要求1或2所述的方法,所述更新后的种子池不包括更新前种子池中所述父本集合以外的累积变量。The method according to claim 1 or 2, wherein the updated seed pool does not include cumulative variables other than the set of paternal parents in the seed pool before the update.
  6. 根据权利要求1或2所述的方法,所述第一父本、所述第二父本均包括多个维度,所述第一父本与所述第二父本之间有一个维度的维度值不同。The method according to claim 1 or 2, wherein each of the first parent and the second parent includes multiple dimensions, and there is a dimension between the first parent and the second parent. The value is different.
  7. 根据权利要求1或2所述的方法,如果目标衍生变量集合不满足收敛条件,则返回根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池的步骤。According to the method of claim 1 or 2, if the set of target derived variables does not meet the convergence condition, the quality of the derived variables generated according to the target genetic algorithm model and its seed pool is returned to determine the updated seed pool of the target genetic algorithm model step.
  8. 根据权利要求1或2所述的方法,The method according to claim 1 or 2,
    目标遗传算法模型更新后的种子池的累积变量的数量等于目标遗传算法模型更新前的种子池的累积变量的数量;或者The number of cumulative variables in the seed pool after the target genetic algorithm model is updated is equal to the number of cumulative variables in the seed pool before the target genetic algorithm model is updated; or
    目标遗传算法模型更新后的种子池的累积变量的数量小于目标遗传算法模型更新前的种子池的累积变量的数量。The number of cumulative variables in the seed pool after the target genetic algorithm model is updated is smaller than the number of cumulative variables in the seed pool before the target genetic algorithm model is updated.
  9. 一种用于风险识别模型的衍生变量选择装置,包括:A device for selecting derivative variables for risk identification models, including:
    种子池确定模块,根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,所述衍生变量的质量用于评估所述衍生变量作为所述目标业务的风险识别模型的样本特征的贡献,所述更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;The seed pool determining module determines the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by the seed pool, wherein the quality of the derived variable is used to evaluate the derived variable as the The contribution of the sample characteristics of the risk identification model of the target business, where the updated seed pool includes the parent set of N best-quality derivative variables generated by the seed pool before the update;
    衍生变量确定模块,根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,所述第一父本累积变量集合和第二父本累积变量集合是基于所述目标遗传算法模型更新后的种子池选择的衍生变量父本;The derivative variable determining module determines the target derivative variable set according to the first paternal cumulative variable set and the second paternal cumulative variable set in the direction of variation of the best derivative variable quality, wherein the first paternal cumulative variable set and the second paternal cumulative variable set are The second paternal cumulative variable set is the derived variable paternal selected based on the updated seed pool of the target genetic algorithm model;
    信息输出模块,如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,The information output module, if the target derivative variable set meets the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model; among them,
    所述衍生变量确定模块,具体根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;在所述候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。The derivative variable determination module specifically selects the first parent in the first parent cumulative variable set from the second parent cumulative variable set according to the first parent cumulative variable set through the derived variable parent matching model. M second parents to generate a candidate derivative variable set; from the candidate derivative variable set, N derivative variables with the best quality are selected as the target derivative variable set.
  10. 一种电子设备,包括:An electronic device including:
    存储器,其上存储有计算机程序;A memory on which a computer program is stored;
    处理器,用于执行所述存储器中的所述计算机程序,以实现:The processor is configured to execute the computer program in the memory to realize:
    根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,所述衍生变量的质量用于评估所述衍生变量作为所述目标业务的风险识别模型的样本特征的贡献,所述更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;Determine the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, wherein the quality of the derived variable is used to evaluate the derived variable as the risk identification of the target business The contribution of the sample characteristics of the model, the updated seed pool includes the parent set of N best-quality derived variables generated by the seed pool before the update;
    根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,所述第一父本累积变量集合和第二父本累积变量集合是基于所述目标遗传算法模型更新后的种子池选择的衍生变量父本;According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable The set is the parent of the derived variables selected based on the seed pool after the update of the target genetic algorithm model;
    如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,If the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model; among them,
    根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,包括:According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including:
    根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;According to the first paternal cumulative variable set, M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set are selected in the second paternal cumulative variable set through the derived variable paternal matching model to Generate a set of candidate derived variables;
    在所述候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.
  11. 一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现:A storage medium on which a computer program is stored, which is realized when the program is executed by a processor:
    根据目标遗传算法模型及其种子池生成的衍生变量的质量,确定目标遗传算法模型更新后的种子池,其中,所述衍生变量的质量用于评估所述衍生变量作为所述目标业务的风险识别模型的样本特征的贡献,所述更新后的种子池包括更新前种子池生成的N个质量最优的衍生变量的父本集合;Determine the updated seed pool of the target genetic algorithm model based on the target genetic algorithm model and the quality of the derived variables generated by its seed pool, wherein the quality of the derived variable is used to evaluate the derived variable as the risk identification of the target business The contribution of the sample characteristics of the model, the updated seed pool includes the parent set of N best-quality derived variables generated by the seed pool before the update;
    根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,其中,所述第一父本累积变量集合和第二父本累积变量集合是基于所述目标遗传算法模型更新后的种子池选择的衍生变量父本;According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, wherein the first paternal cumulative variable set and the second paternal cumulative variable The set is the parent of the derived variables selected based on the seed pool after the update of the target genetic algorithm model;
    如果目标衍生变量集合满足衍生变量的质量收敛条件,输出目标衍生变量集合中的衍生变量,以作为风险识别模型的样本特征;其中,If the target derivative variable set satisfies the quality convergence condition of the derivative variable, output the derivative variable in the target derivative variable set as the sample feature of the risk identification model; among them,
    根据第一父本累积变量集合和第二父本累积变量集合,以衍生变量质量最优的变异方向确定目标衍生变量集合,包括:According to the first paternal cumulative variable set and the second paternal cumulative variable set, the target derivative variable set is determined in the direction of variation with the best quality of the derivative variable, including:
    根据第一父本累积变量集合,通过衍生变量父本匹配模型在第二父本累积变量集合中选择第一父本累积变量集合中的第一父本所匹配的M个第二父本,以生成候选衍生变量集合;According to the first paternal cumulative variable set, M second paternal parents matched by the first paternal parent in the first paternal cumulative variable set are selected in the second paternal cumulative variable set through the derived variable paternal matching model, and Generate a set of candidate derived variables;
    在所述候选衍生变量集合中选择N个质量最优的衍生变量作为目标衍生变量集合。Select N derived variables with the best quality from the candidate derived variable set as the target derived variable set.
PCT/CN2021/073963 2020-03-31 2021-01-27 Derived variable selection method and apparatus for risk identification model WO2021196843A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010244612.0 2020-03-31
CN202010244612.0A CN111461892B (en) 2020-03-31 2020-03-31 Method and device for selecting derived variables of risk identification model

Publications (1)

Publication Number Publication Date
WO2021196843A1 true WO2021196843A1 (en) 2021-10-07

Family

ID=71683398

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073963 WO2021196843A1 (en) 2020-03-31 2021-01-27 Derived variable selection method and apparatus for risk identification model

Country Status (2)

Country Link
CN (1) CN111461892B (en)
WO (1) WO2021196843A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461892B (en) * 2020-03-31 2021-07-06 支付宝(杭州)信息技术有限公司 Method and device for selecting derived variables of risk identification model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154814A1 (en) * 2006-12-22 2008-06-26 American Express Travel Related Services Company, Inc. Automated Predictive Modeling
CN108346098A (en) * 2018-01-19 2018-07-31 阿里巴巴集团控股有限公司 A kind of method and device of air control rule digging
CN108875815A (en) * 2018-06-04 2018-11-23 深圳市研信小额贷款有限公司 Feature Engineering variable determines method and device
CN110046799A (en) * 2019-03-08 2019-07-23 阿里巴巴集团控股有限公司 Decision optimization method and device
CN110472742A (en) * 2019-07-11 2019-11-19 阿里巴巴集团控股有限公司 A kind of model variable determines method, device and equipment
CN110503566A (en) * 2019-07-08 2019-11-26 中国平安人寿保险股份有限公司 Air control method for establishing model, device, computer equipment and storage medium
CN111461892A (en) * 2020-03-31 2020-07-28 支付宝(杭州)信息技术有限公司 Method and device for selecting derived variables of risk identification model

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040215551A1 (en) * 2001-11-28 2004-10-28 Eder Jeff S. Value and risk management system for multi-enterprise organization
CN101782976B (en) * 2010-01-15 2013-04-10 南京邮电大学 Automatic selection method for machine learning in cloud computing environment
US10043591B1 (en) * 2015-02-06 2018-08-07 Brain Trust Innovations I, Llc System, server and method for preventing suicide
CN107679985B (en) * 2017-09-12 2021-01-05 创新先进技术有限公司 Risk feature screening and description message generating method and device and electronic equipment
CN109492844B (en) * 2017-09-12 2022-04-15 杭州蚂蚁聚慧网络技术有限公司 Method and device for generating business strategy
CN107862468A (en) * 2017-11-23 2018-03-30 深圳市智物联网络有限公司 The method and device that equipment Risk identification model is established
CN108460523B (en) * 2018-02-12 2020-08-21 阿里巴巴集团控股有限公司 Wind control rule generation method and device
US20190325528A1 (en) * 2018-04-24 2019-10-24 Brighterion, Inc. Increasing performance in anti-money laundering transaction monitoring using artificial intelligence
CN109191283A (en) * 2018-08-30 2019-01-11 成都数联铭品科技有限公司 Method for prewarning risk and system
CN109523118A (en) * 2018-10-11 2019-03-26 平安科技(深圳)有限公司 Risk data screening technique, device, computer equipment and storage medium
CN109711435A (en) * 2018-12-03 2019-05-03 三峡大学 A kind of support vector machines on-Line Voltage stability monitoring method based on genetic algorithm
CN109816090A (en) * 2019-02-15 2019-05-28 南京邮电大学 A kind of modified EO-1 hyperion end member extraction method based on discrete variable
CN110008991B (en) * 2019-02-26 2023-05-02 创新先进技术有限公司 Risk event identification method, risk identification model generation method, risk event identification device, risk identification equipment and risk identification medium
CN110442712B (en) * 2019-07-05 2023-08-22 创新先进技术有限公司 Risk determination method, risk determination device, server and text examination system
CN110458572B (en) * 2019-07-08 2023-11-24 创新先进技术有限公司 User risk determining method and target risk recognition model establishing method
CN110503296B (en) * 2019-07-08 2022-05-06 招联消费金融有限公司 Test method, test device, computer equipment and storage medium
CN110852444A (en) * 2019-10-11 2020-02-28 支付宝(杭州)信息技术有限公司 Method and apparatus for determining derived variables of machine learning model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154814A1 (en) * 2006-12-22 2008-06-26 American Express Travel Related Services Company, Inc. Automated Predictive Modeling
CN108346098A (en) * 2018-01-19 2018-07-31 阿里巴巴集团控股有限公司 A kind of method and device of air control rule digging
CN108875815A (en) * 2018-06-04 2018-11-23 深圳市研信小额贷款有限公司 Feature Engineering variable determines method and device
CN110046799A (en) * 2019-03-08 2019-07-23 阿里巴巴集团控股有限公司 Decision optimization method and device
CN110503566A (en) * 2019-07-08 2019-11-26 中国平安人寿保险股份有限公司 Air control method for establishing model, device, computer equipment and storage medium
CN110472742A (en) * 2019-07-11 2019-11-19 阿里巴巴集团控股有限公司 A kind of model variable determines method, device and equipment
CN111461892A (en) * 2020-03-31 2020-07-28 支付宝(杭州)信息技术有限公司 Method and device for selecting derived variables of risk identification model

Also Published As

Publication number Publication date
CN111461892A (en) 2020-07-28
CN111461892B (en) 2021-07-06

Similar Documents

Publication Publication Date Title
CN108460523B (en) Wind control rule generation method and device
CN109978538B (en) Method and device for determining fraudulent user, training model and identifying fraudulent risk
US20190130116A1 (en) Method and device for controlling data risk
TW201939379A (en) Information conversion rate prediction method and apparatus, and information recommendation method and apparatus
US10832250B2 (en) Long-term short-term cascade modeling for fraud detection
WO2020063116A1 (en) Risk guarantee product pushing method and apparatus, and electronic device
CN110874491B (en) Privacy data processing method and device based on machine learning and electronic equipment
US11257088B2 (en) Knowledge neighbourhoods for evaluating business events
US11250433B2 (en) Using semi-supervised label procreation to train a risk determination model
US11397950B2 (en) Systems and methods for authenticating an electronic transaction
WO2020177478A1 (en) Credit-based qualification information auditing method, apparatus and device
JP2013058192A (en) System, method and computer program product for parcel assessment
CN111260368A (en) Account transaction risk judgment method and device and electronic equipment
CN108550046A (en) A kind of resource and market recommendation method, apparatus and electronic equipment
CN114187112A (en) Training method of account risk model and determination method of risk user group
WO2020177477A1 (en) Credit service recommendation method, apparatus, and device
CN112598472A (en) Product recommendation method, device, system, medium and program product
WO2021196843A1 (en) Derived variable selection method and apparatus for risk identification model
CN110008986B (en) Batch risk case identification method and device and electronic equipment
CN111754287A (en) Article screening method, apparatus, device and storage medium
WO2019144808A1 (en) Method and apparatus for determining false resource transfer, method and apparatus for determining false trading, and electronic device
CN111582872A (en) Abnormal account detection model training method, abnormal account detection device and abnormal account detection equipment
CN111275071B (en) Prediction model training method, prediction device and electronic equipment
CN112446777A (en) Credit evaluation method, device, equipment and storage medium
CN113159834B (en) Commodity information sorting method, device and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21781273

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21781273

Country of ref document: EP

Kind code of ref document: A1