WO2021012783A1 - Insurance policy underwriting model training method employing big data, and underwriting risk assessment method - Google Patents

Insurance policy underwriting model training method employing big data, and underwriting risk assessment method Download PDF

Info

Publication number
WO2021012783A1
WO2021012783A1 PCT/CN2020/093039 CN2020093039W WO2021012783A1 WO 2021012783 A1 WO2021012783 A1 WO 2021012783A1 CN 2020093039 W CN2020093039 W CN 2020093039W WO 2021012783 A1 WO2021012783 A1 WO 2021012783A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
risk
sample data
data sets
model
Prior art date
Application number
PCT/CN2020/093039
Other languages
French (fr)
Chinese (zh)
Inventor
王进
刘行行
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021012783A1 publication Critical patent/WO2021012783A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular to a method, system, computer equipment, computer-readable storage medium, and underwriting risk assessment method based on big data for insurance policy underwriting model training.
  • the purpose of the embodiments of the present application is to provide a method, system, computer equipment, and computer-readable storage medium for training an insurance policy underwriting model based on big data, which can solve traditional data mining and data modeling underwriting The problem of low risk accuracy.
  • the embodiment of the present application provides a method for training an insurance policy underwriting model based on big data, including the following steps:
  • Pre-configure a set of risk feature items the set of risk feature items includes multiple risk feature items
  • each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
  • Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  • the embodiment of the present application also provides a big data-based insurance policy underwriting model training system, including:
  • the configuration module is used to pre-configure a set of risk characteristic items, and the set of risk characteristic items includes multiple risk characteristic items;
  • the acquiring module is used to acquire multiple sample data sets of multiple customers from the customer information database based on the risk feature item set, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items ;
  • the filling module is used to fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items
  • the analysis module is used to analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
  • the screening module is used to screen out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
  • the risk feature combination output module is used to input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model, so as to output corresponding to the multiple through the iterative decision tree model Multiple risk feature combinations of a sample data set;
  • the training module is used to train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  • an embodiment of the present application further provides a computer device, the computer device including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the When the computer-readable instructions are executed by the processor, the following steps are implemented:
  • Pre-configure a set of risk feature items the set of risk feature items includes multiple risk feature items
  • each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
  • Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  • an embodiment of the present application also provides a computer-readable storage medium having computer-readable instructions stored in the computer-readable storage medium, and the computer-readable instructions may be executed by at least one processor, So that the at least one processor executes the following steps:
  • Pre-configure a set of risk feature items the set of risk feature items includes multiple risk feature items
  • each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
  • Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  • the embodiments of the present application also provide an underwriting risk assessment method, which includes the following steps:
  • Target data set of the target customer Acquiring a target data set of the target customer, the target data set including multiple risk characteristics corresponding to multiple risk characteristic items;
  • the risk feature combination is predicted according to the underwriting risk assessment model to obtain the risk coefficient of the target customer, and the underwriting risk assessment model is obtained by training the above-mentioned big data-based insurance policy underwriting model training method.
  • the underwriting risk assessment model includes a logistic regression model, a factorization machine model, and a deep neural network model;
  • the step of predicting the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer includes:
  • the risk coefficient of the target customer is calculated according to the first risk coefficient output by the logistic regression model, the second risk coefficient output by the factorization machine model, and the third risk coefficient output by the deep neural network model.
  • the big data-based insurance policy underwriting model training method, system, computer equipment, computer-readable storage medium, and underwriting risk assessment method provide iterative decision-making based on the sample data set and each risk characteristic information value
  • the multiple risk feature combinations output by the tree model are inputted into multiple target models to construct a policy underwriting risk assessment model.
  • the constructed policy underwriting risk assessment model also has multiple target models. Data evaluation advantage, with high evaluation accuracy for underwriting risk evaluation.
  • FIG. 1 is a flowchart of Embodiment 1 of a method for training an insurance policy underwriting model based on big data in this application.
  • Fig. 2 is a flowchart of step S104 in Fig. 1.
  • FIG. 3 is a schematic diagram of program modules of Embodiment 2 of the insurance policy underwriting model training system based on big data of this application.
  • FIG. 4 is a schematic diagram of the hardware structure of the third embodiment of the computer equipment of this application.
  • Figure 5 is a flowchart of Embodiment 5 of the method for underwriting risk assessment of the application.
  • FIG. 1 shows a flowchart of steps of a method for training an insurance policy underwriting model based on big data in Embodiment 1 of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. details as follows.
  • Step S100 Pre-configure a set of risk feature items, and the set of risk feature items includes multiple risk feature items.
  • the set of risk feature items may include multiple sub-sets, such as: customer-related risk feature sub-set, insurance policy risk feature sub-set, salesperson risk feature sub-set, related information risk feature sub-set, Internet risk feature sub-set Wait.
  • the customer-related risk sub-set may include: basic customer information (gender, age, occupation, education, etc.), social security information, and social relationship with the beneficiary.
  • the insurance policy risk characteristic sub-set may include: policy insurance amount, insurance type information, etc.
  • the sub-set of salesperson risk characteristics may include basic information (salesperson's gender, age, and years of experience), sales habits, commissions, product sales data, team members, attendance information, quality information, and so on.
  • the associated information risk feature subset may include family information, etc.
  • the Internet risk feature subset may include purchase behavior information, product-related information, and so on.
  • the multiple risk feature items in the above risk feature item set can be customized by the user, or can be obtained through an unsupervised neural network model analysis for feature classification.
  • Step S102 Acquire multiple sample data sets of multiple customers from the customer information database based on the risk feature item set, and each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items.
  • each sample data set includes N sample original features corresponding to N risk feature items, and the M sample data sets are:
  • a 1 (a 11 , a 12 , a 13 , whila 1N )
  • a 2 (a 21 , a 22 , a 23 , whila 2N )
  • a M (a M1 , a M2 , a M3 , whila MN )
  • step S104 the original features of multiple samples in each sample data set are filled into the fields of the corresponding risk feature items.
  • the original features of the multiple samples corresponding to the multiple sample data sets may constitute N feature columns, for example:
  • a characteristic column is formed...; a 13 , a 23 , a 33 ,...a MN are filled into a field corresponding to a field name to form a characteristic column.
  • the step S104 further includes:
  • S104a Divide the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to a preset rule.
  • the multiple sample data sets input the multiple sample data sets into an RF (Random Forest) classification model, and classify multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample
  • the samples of the first type are samples of old customers
  • the samples of the second type are samples of new customers. Therefore, the multiple sample data sets are divided into the first group of sample data sets corresponding to the samples of the first type, and The second sample data set corresponding to the second type of sample. It is not difficult to understand that the risk characteristics of the sample data set of the old customer sample are relatively complete, while the risk characteristics of the sample data set of the new customer sample may be somewhat incomplete.
  • S104b Determine whether the multiple sample data sets in the second set of sample data sets include one or more data missing samples, the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features It means that the original feature of the sample corresponding to a certain risk feature item is null.
  • a KD tree is constructed from each sample in the first set of sample numbers, and the original features of the samples corresponding to the missing data samples are input into the nearest neighbor search (KD_tree, K-dimension tree) model, and the The KD_tree model finds the target sample closest to the missing data sample, and fills the target data corresponding to the sample blank feature in the target sample to the corresponding field position.
  • KD_tree nearest neighbor search
  • the sample data set of the missing data sample and the multiple sample data sets in the first set of sample data sets are input into the random forest classification model to obtain the node of the leaf node corresponding to each sample in the first set of sample data set in the decision tree Number, wherein each leaf node has a unique node number.
  • This embodiment can solve the problem of blank original features of some samples.
  • Step S106 Analyze the original characteristics of the multiple samples corresponding to each risk feature item to obtain the information value (IV, information value) of each risk feature item.
  • the information value is used to indicate the degree of influence of the corresponding risk feature on the prediction accuracy of the risk assessment.
  • IV i WoE i *(Py i -Pn i )
  • WoE i Weight of Evidence, weight of evidence
  • WoE value expresses an impact on the results of underwriting risk assessment when a variable takes a certain value.
  • Py i represents the characteristic After the column is discrete, the ratio of the number of high-risk insurance in each age range to the number of high-risk insurance in all age ranges; Pn i represents the number of non-high-risk insurance in each age range and the number of non-high-risk insurance in all age ranges Ratio.
  • IV i represents the information value of each age range, and IV represents the information value of all age ranges in the feature column.
  • Step S108 According to the information value of each risk feature item, multiple target risk feature items are screened out from the multiple risk feature items.
  • step S106 Perform univariate analysis through step S106, thereby screening some risk feature items (ie, multiple target risk feature items) from the multiple risk feature items, and the risk features corresponding to the selected risk feature items will be Input to the iterative decision tree model. It is not difficult to understand that this step can be a basis for eliminating invalid feature items to reduce the training burden.
  • Step S110 Input multiple original features of samples corresponding to multiple target risk feature items in each sample data set into an iterative decision tree model (Gradient Boosting Decision Tree, GBDT) to pass the iterative decision tree
  • the model output corresponds to multiple risk feature combinations of the multiple sample data sets.
  • the iterative decision tree model can be a GBDT (Gradient Boosting Decision Tree) model, which is based on an iterative decision tree algorithm.
  • the decision tree algorithm is composed of multiple decision trees.
  • the specific structure is: The residuals of the previous K trees are combined, and each tree depends on the result of the previous tree. Therefore, a certain order between decision trees needs to be guaranteed.
  • the multiple decision trees in the GBDT model are used to classify the multiple sample data sets, so as to find the association relationship between the various risk features in the multiple sample data sets, and combine the features with the association relationship. Combine to get a combination of risk characteristics.
  • each decision tree in the GBDT model includes a root node, an intermediate node, and a leaf node.
  • the root node and each intermediate node have a corresponding risk feature item (such as age) and risk feature value (such as age 30). If a sample of customers is older than 30 years old, the sample will be assigned to the right of the node The child node, otherwise it is assigned to the left child node, and the lower node is the same, until the sample falls to a certain leaf node. According to the leaf nodes that the sample falls on each decision tree, the risk feature combination corresponding to the sample is obtained. It can be understood that when there are multiple samples, multiple corresponding risk feature combinations will be obtained.
  • Step S112 Train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  • the multiple target models include LR (loss function, logistic regression) model, FM (Factorization Machine, factorization machine) model and deep network neural model.
  • LR model It has high interpretability. Using the multiple risk feature combinations output by the GBDT model as the input of the LR model can also effectively improve the evaluation effect of the LR model;
  • FM model The multiple risk feature combinations output by the GBDT model are used as the input of the FM model, which can better mine the correlation between the risk feature items under highly sparse conditions, especially if there is no crossover in the training sample Case of data.
  • Deep neural network model Compared with the LR model, the interpretability is lower, but it has the advantage of high evaluation accuracy. Using the multiple risk feature combinations output by the GBDT model as the input of the deep neural network model can improve the evaluation accuracy.
  • the deep neural network model may include DNN or ANN.
  • DNN is suitable for big data and distributed training, taking training DNN as an example for illustration.
  • the input layer of the DNN is used to input the multiple risk feature combinations output by the GBDT model, and the output layer can output predicted risk coefficients. It is understandable that, for each sample data set, that is, after the multiple risk feature combinations corresponding to the sample data set are input to the DNN, the DNN will output the corresponding predicted risk coefficient. If the probability that each predicted risk coefficient matches the sample label of the corresponding sample reaches a preset threshold, where the preset threshold can be set according to empirical values, it can be considered that an optimized DNN has been obtained.
  • the big data-based insurance policy underwriting model training system 20 may include or be divided into one or more program modules, one or more program modules are stored in a storage medium, and are composed of one or more It is executed by the processor to complete the application and realize the above-mentioned method for training the insurance policy underwriting model based on big data.
  • the program module referred to in the embodiments of the present application refers to a series of computer-readable instruction segments that can complete specific functions, and is more suitable than the program itself to describe the execution process of the insurance policy underwriting model training system 20 based on big data in the storage medium. .
  • the following description will specifically introduce the functions of each program module in this embodiment:
  • the configuration module 200 is configured to pre-configure a set of risk feature items, and the set of risk feature items includes multiple risk feature items.
  • the acquiring module 202 is configured to acquire multiple sample data sets of multiple customers from the customer information database based on the set of risk feature items, each sample data set includes multiple sample data sets corresponding to the customer and the multiple risk feature items feature.
  • the filling module 204 is used for filling the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items.
  • the filling module 204 is further configured to: divide the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules; Whether multiple sample data sets in the group sample data set include one or more missing data samples, the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to the corresponding risk feature items
  • the original feature of the sample is a null value; if the multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more samples from the multiple sample data sets in the first set of sample data sets The original feature is filled into the field position corresponding to the blank feature of the sample.
  • dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to a preset rule specifically includes the following: inputting the multiple sample data sets into a random forest classification model, The multiple samples corresponding to the multiple sample data sets are classified into a first type of sample and a second type of sample; wherein the first type of sample corresponds to the first group of sample data set, and the second type of sample Corresponds to the second set of sample data sets.
  • selecting one or more sample original features in the multiple sample data sets in the first set of sample data sets to fill in the field positions corresponding to the blank features of the sample includes the following: Each sample constructs a KD tree; the original features of the sample corresponding to the missing data sample are input into the nearest neighbor search model; the nearest neighbor search model is used to find the target sample closest to the missing data sample; the target sample The target data corresponding to the blank feature of the sample is filled in the corresponding field position; wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data.
  • the analysis module 206 is configured to analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item.
  • the screening module 208 is configured to screen out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item.
  • the risk feature combination output module 210 is configured to input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model, so as to output corresponding to the Multiple risk characteristic combinations of multiple sample data sets.
  • the training module 212 is configured to train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  • the multiple target models may include a logistic regression model, a factorization machine model, and a deep neural network model.
  • the computer device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions.
  • the computer device 2 may be a rack server, a blade server, a tower server, or a cabinet server (including an independent server, or a server cluster composed of multiple servers).
  • the computer device 2 at least includes, but is not limited to, a memory 21, a processor 22, a network interface 23, and a big data-based insurance policy underwriting model training system 20 that can communicate with each other through a system bus. among them:
  • the memory 21 includes at least one type of computer-readable storage medium.
  • the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or memory of the computer device 2.
  • the memory 21 may also be an external storage device of the computer device 2, for example, a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc.
  • the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device.
  • the memory 21 is generally used to store the operating system and various application software installed in the computer device 2, for example, the program code of the insurance policy underwriting model training system 20 based on big data in the fifth embodiment.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 22 is generally used to control the overall operation of the computer device 2.
  • the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the big data-based insurance policy underwriting model training system 20, to implement the big data-based policy core of the first embodiment. Guarantee model training method.
  • the network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic devices.
  • the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
  • FIG. 4 only shows the computer device 2 with components 20-23, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the big data-based insurance policy underwriting model training system 20 stored in the memory 21 may also be divided into one or more program modules, and the one or more program modules are stored in the memory. 21, and executed by one or more processors (processor 22 in this embodiment) to complete the application.
  • FIG. 3 shows a schematic diagram of the program modules of the second embodiment of the big data-based insurance policy underwriting model training system 20.
  • the big data-based insurance policy underwriting model training system 20 can be It is divided into a configuration module 200, an acquisition module 202, a filling module 204, an analysis module 206, a screening module 208, a risk feature combination output module 210, and a training module 212.
  • the program module referred to in this application refers to a series of computer-readable instruction segments that can complete specific functions, and is more suitable than a program to describe the big data-based insurance policy underwriting model training system 20 in the computer device 2 In the implementation process.
  • the specific functions of the program modules 200-212 have been described in detail in the second embodiment, and will not be repeated here.
  • the computer-readable storage medium may be non-volatile or volatile, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX). Memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory , Magnetic disks, optical disks, servers, App application malls, etc., on which computer-readable instructions are stored, and corresponding functions are realized when the programs are executed by the processor.
  • the computer-readable storage medium of this embodiment is used to store the insurance policy underwriting model training system 20 based on big data, and the processor executes the following steps:
  • Pre-configure a set of risk feature items the set of risk feature items includes multiple risk feature items
  • each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
  • Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  • FIG. 5 shows a flow chart of the steps of the underwriting risk assessment method of the fifth embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. details as follows.
  • Step S200 Obtain a target data set of a target customer, where the target data set includes multiple risk features corresponding to multiple risk feature items.
  • the multiple risk characteristics in the target data set may come from the fill-in content of the target customer's form, or from the company's internal historical data for the target customer, or from a third-party database.
  • Step S202 Determine whether there are blank risk characteristics in the target data set of the target customer. If yes, go to step S204; otherwise, go to step S206.
  • Step S204 Find the target sample closest to the target customer through the nearest neighbor search model, so as to fill the blank risk characteristics of the target data set with the risk characteristics in the target sample.
  • Step S206 Input the filled target data set into the iterative decision tree model. Proceed to step S210.
  • Step S208 Input the target data set obtained in step S200 into the iterative decision tree model. Proceed to step S210.
  • Step S210 output the corresponding risk feature combination through the iterative decision tree model.
  • Step S212 Predict the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer.
  • the underwriting risk assessment model includes a logistic regression model, a factorization machine model and a deep neural network model.
  • Step S212 may further include: calculating the risk coefficient of the target customer according to the first risk coefficient output by the logistic regression model, the second risk coefficient output by the factorization machine model, and the third risk coefficient output by the deep neural network model.
  • the calculation method can be customized. For example, the average value of the first, second, and third risk coefficients can be calculated, and the average value can be used as the risk coefficient of the target customer.

Abstract

An insurance policy underwriting model training method employing big data comprises: acquiring multiple sample data sets of multiple customers on the basis of a pre-configured risk feature item set, wherein each sample data set comprises multiple sample original features corresponding to a customer and multiple risk feature items; adding the multiple sample original features in each sample data set to a field of a corresponding risk feature item (S104); acquiring multiple target risk feature items from the multiple risk feature items by means of filtering and on the basis of an information value of each risk feature item (S108); inputting the multiple sample original features corresponding to the multiple target risk feature items in each sample data set into an iterative decision tree model so as to output, by means of the iterative decision tree model, multiple risk feature combinations corresponding to the multiple sample data sets (S110); and training multiple target models according to the multiple risk feature combinations so as to construct an insurance policy underwriting risk assessment model (S112). The method has high assessment accuracy for underwriting risk assessment.

Description

基于大数据的保单核保模型训练方法和核保风险评估方法Big data-based insurance policy underwriting model training method and underwriting risk assessment method
本申请申明2019年07月23日提交中国专利局、申请号为201910665008.2、发明名称为“基于大数据的保单核保模型训练方法和核保风险评估方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application affirms the priority of the Chinese patent application filed with the Chinese Patent Office on July 23, 2019, the application number is 201910665008.2, and the invention title is "Big data-based insurance policy underwriting model training method and underwriting risk assessment method", which The entire content is incorporated into this application by reference.
技术领域Technical field
本申请实施例涉及人工智能领域,尤其涉及一种基于大数据的保单核保模型训练方法、系统、计算机设备、计算机可读存储介质,以及核保风险评估方法。The embodiments of the present application relate to the field of artificial intelligence, and in particular to a method, system, computer equipment, computer-readable storage medium, and underwriting risk assessment method based on big data for insurance policy underwriting model training.
背景技术Background technique
随着人们保险意识逐渐增强,商业保险已经成为当前社会保障体系的重要组成部分。根据可参考数据,部分保险机构的保单数量在千万级别。这些保单在保险系统中生成之后,需要对保单进行核保,以确定保单中的信息是否符合参保要求。现在对保单的核保方式,一般是由人工进行核保。如基于风险控制规则以及不同客户群的产品定价信息、辅助信息(体检信息、健康调查信息、财务调查信息)等,由人工对保单进行复核。As people's insurance awareness gradually increases, commercial insurance has become an important part of the current social security system. According to available reference data, the number of insurance policies of some insurance institutions is in the tens of millions. After these insurance policies are generated in the insurance system, they need to be underwritten to determine whether the information in the insurance policy meets the requirements for insurance participation. The current way of underwriting insurance policies is generally manual underwriting. For example, based on risk control rules and product pricing information of different customer groups, auxiliary information (physical examination information, health survey information, financial survey information), etc., insurance policies are manually reviewed.
然而,随着大数据挖的快速发展,用于核保的可参照数据越来越多。发明人意识到,若是人工进行核保,不仅会造成大量的人力消耗,而且核保效率较低,对多维数据难以有效利用而导致核保风险精度低。因此,如何基于大数据进行数据建模并通过数据模型进行保单核保,是当前的研究方向之一。However, with the rapid development of big data mining, more and more reference data is used for underwriting. The inventor realizes that if the underwriting is performed manually, it will not only cause a lot of manpower consumption, but also the underwriting efficiency is low, and it is difficult to effectively use the multi-dimensional data, resulting in low underwriting risk accuracy. Therefore, how to conduct data modeling based on big data and underwrite insurance policies through data models is one of the current research directions.
发明内容Summary of the invention
有鉴于此,本申请实施例的目的是提供一种基于大数据的保单核保模型训练方法、系统、计算机设备及计算机可读存储介质,可以解决传统的数据挖掘和数据建模的核保风险精度低的问题。In view of this, the purpose of the embodiments of the present application is to provide a method, system, computer equipment, and computer-readable storage medium for training an insurance policy underwriting model based on big data, which can solve traditional data mining and data modeling underwriting The problem of low risk accuracy.
为实现上述目的,本申请实施例提供了一种基于大数据的保单核保模型训练方法,包括以下步骤:In order to achieve the foregoing objectives, the embodiment of the present application provides a method for training an insurance policy underwriting model based on big data, including the following steps:
预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;
基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项;Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and
根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
为实现上述目的,本申请实施例还提供了基于大数据的保单核保模型训练系统,包括:In order to achieve the above objectives, the embodiment of the present application also provides a big data-based insurance policy underwriting model training system, including:
配置模块,用于预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;The configuration module is used to pre-configure a set of risk characteristic items, and the set of risk characteristic items includes multiple risk characteristic items;
获取模块,用于基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;The acquiring module is used to acquire multiple sample data sets of multiple customers from the customer information database based on the risk feature item set, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items ;
填充模块,用于将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;The filling module is used to fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
分析模块,用于对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;The analysis module is used to analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
筛选模块,用于根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个 目标风险特征项;The screening module is used to screen out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
风险特征组合输出模块,用于将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及The risk feature combination output module is used to input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model, so as to output corresponding to the multiple through the iterative decision tree model Multiple risk feature combinations of a sample data set; and
训练模块,用于根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。The training module is used to train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
为实现上述目的,本申请实施例还提供了一种计算机设备,所述计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:In order to achieve the foregoing objective, an embodiment of the present application further provides a computer device, the computer device including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the When the computer-readable instructions are executed by the processor, the following steps are implemented:
预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;
基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项;Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and
根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
为实现上述目的,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:In order to achieve the foregoing objective, an embodiment of the present application also provides a computer-readable storage medium having computer-readable instructions stored in the computer-readable storage medium, and the computer-readable instructions may be executed by at least one processor, So that the at least one processor executes the following steps:
预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;
基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项;Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and
根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
为实现上述目的,本申请实施例还提供了一种核保风险评估方法,包括以下步骤:In order to achieve the foregoing objectives, the embodiments of the present application also provide an underwriting risk assessment method, which includes the following steps:
获取目标客户的目标数据集,所述目标数据集中包括多个风险特征项对应的多个风险特征;Acquiring a target data set of the target customer, the target data set including multiple risk characteristics corresponding to multiple risk characteristic items;
判断所述目标客户的目标数据集中是否有空白风险特征;Determine whether there are blank risk characteristics in the target data set of the target customer;
如果有空白风险特征,通过最近邻搜索模型查找到与目标客户最邻近的目标样本,以将所述目标样本中的风险特征填充所述目标数据集的空白风险特征;If there are blank risk characteristics, find the target sample closest to the target customer through the nearest neighbor search model, so as to fill the blank risk characteristics of the target data set with the risk characteristics in the target sample;
将填充后的目标数据集输入到迭代决策树模型中;Input the filled target data set into the iterative decision tree model;
通过所述迭代决策树模型输出对应的风险特征组合;Output the corresponding risk feature combination through the iterative decision tree model;
根据核保风险评估模型对所述风险特征组合进行预测以获取所述目标客户的风险系数,所述核保风险评估模型上述基于大数据的保单核保模型训练方法训练得到。The risk feature combination is predicted according to the underwriting risk assessment model to obtain the risk coefficient of the target customer, and the underwriting risk assessment model is obtained by training the above-mentioned big data-based insurance policy underwriting model training method.
优选地,所述核保风险评估模型包括逻辑回归模型、因子分解机模型和深度神经网络模型;Preferably, the underwriting risk assessment model includes a logistic regression model, a factorization machine model, and a deep neural network model;
根据核保风险评估模型对所述风险特征组合进行预测以获取所述目标客户的风险系数的步骤,包括:The step of predicting the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer includes:
根据逻辑回归模型输出的第一风险系数、因子分解机模型输出的第二风险系数和深度神经网络模型输出的第三风险系数,计算所述目标客户的所述风险系数。The risk coefficient of the target customer is calculated according to the first risk coefficient output by the logistic regression model, the second risk coefficient output by the factorization machine model, and the third risk coefficient output by the deep neural network model.
本申请实施例提供的基于大数据的保单核保模型训练方法、系统、计算机设备、计算机可读存储介质以及核保风险评估方法,以样本数据集和各个风险特征信息值为基础得到迭代决策树模型输出的多个风险特征组合,将多个风险特征组合输入多个目标模型中以构建保单核保风险评估模型,所构建的保单核保风险评估模型兼具有多个目标模型的数据评估优势,针对核保风险评估具有较高的评估精确度。The big data-based insurance policy underwriting model training method, system, computer equipment, computer-readable storage medium, and underwriting risk assessment method provided by the embodiments of the application provide iterative decision-making based on the sample data set and each risk characteristic information value The multiple risk feature combinations output by the tree model are inputted into multiple target models to construct a policy underwriting risk assessment model. The constructed policy underwriting risk assessment model also has multiple target models. Data evaluation advantage, with high evaluation accuracy for underwriting risk evaluation.
附图说明Description of the drawings
图1为本申请基于大数据的保单核保模型训练方法实施例一的流程图。FIG. 1 is a flowchart of Embodiment 1 of a method for training an insurance policy underwriting model based on big data in this application.
图2为图1中步骤S104的流程图。Fig. 2 is a flowchart of step S104 in Fig. 1.
图3为本申请基于大数据的保单核保模型训练系统实施例二的程序模块示意图。FIG. 3 is a schematic diagram of program modules of Embodiment 2 of the insurance policy underwriting model training system based on big data of this application.
图4为本申请计算机设备实施例三的硬件结构示意图。FIG. 4 is a schematic diagram of the hardware structure of the third embodiment of the computer equipment of this application.
图5为本申请核保风险评估方法实施例五的流程图。Figure 5 is a flowchart of Embodiment 5 of the method for underwriting risk assessment of the application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and not used to limit the application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions related to "first", "second", etc. in this application are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features. In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Not within the scope of protection required by this application.
以下实施例将以计算机设备2为执行主体进行示例性描述。The following embodiments will exemplarily describe the computer device 2 as the execution subject.
实施例一Example one
参阅图1,示出了本申请实施例一之基于大数据的保单核保模型训练方法的步骤流程图。可以理解,本方法实施例中的流程图不用于对执行步骤的顺序进行限定。具体如下。Referring to FIG. 1, it shows a flowchart of steps of a method for training an insurance policy underwriting model based on big data in Embodiment 1 of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. details as follows.
步骤S100,预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项。Step S100: Pre-configure a set of risk feature items, and the set of risk feature items includes multiple risk feature items.
示例性的,所述风险特征项集合可以包括多个子集合,如:客户相关风险特征子集合、保单风险特征子集合、业务员风险特征子集合、关联信息风险特征子集合、互联网风险特征子集合等。客户相关风险子集合可以包括:客户基本信息(性别、年龄、职业、学历等)、社保信息、与收益人的社会关系等。保单风险特征子集合可以包括:保单保额、险种信息等。业务员风险特征子集合可以包括基本信息(业务员性别、年龄、从业年限)、销售习惯、佣金提成、产品销售数据、所属团队、考勤信息、品质信息等等。关联信息风险特征子集合可以包括:家庭信息等。互联网风险特征子集合可以包括购买行为信息、产品关联信息等。Exemplarily, the set of risk feature items may include multiple sub-sets, such as: customer-related risk feature sub-set, insurance policy risk feature sub-set, salesperson risk feature sub-set, related information risk feature sub-set, Internet risk feature sub-set Wait. The customer-related risk sub-set may include: basic customer information (gender, age, occupation, education, etc.), social security information, and social relationship with the beneficiary. The insurance policy risk characteristic sub-set may include: policy insurance amount, insurance type information, etc. The sub-set of salesperson risk characteristics may include basic information (salesperson's gender, age, and years of experience), sales habits, commissions, product sales data, team members, attendance information, quality information, and so on. The associated information risk feature subset may include family information, etc. The Internet risk feature subset may include purchase behavior information, product-related information, and so on.
需要说明的是,上述风险特征项集合的多个风险特征项可以由用户自定义,也可以通过用于特征分类的无监督神经网络模型分析得到。It should be noted that the multiple risk feature items in the above risk feature item set can be customized by the user, or can be obtained through an unsupervised neural network model analysis for feature classification.
步骤S102,基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征。Step S102: Acquire multiple sample data sets of multiple customers from the customer information database based on the risk feature item set, and each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items.
例如,获取M个客户对应的M个样本数据集,每个样本数据集中包括N个风险特征项对应的N个样本原始特征,M个样本数据集分别为:For example, to obtain M sample data sets corresponding to M customers, each sample data set includes N sample original features corresponding to N risk feature items, and the M sample data sets are:
A 1(a 11,a 12,a 13,……a 1N) A 1 (a 11 , a 12 , a 13 ,……a 1N )
A 2(a 21,a 22,a 23,……a 2N) A 2 (a 21 , a 22 , a 23 ,……a 2N )
......
A M(a M1,a M2,a M3,……a MN) A M (a M1 , a M2 , a M3 ,……a MN )
步骤S104,将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中。In step S104, the original features of multiple samples in each sample data set are filled into the fields of the corresponding risk feature items.
所述多个样本数据集中对应的多个样本原始特征可以构成N个特征列,例如:The original features of the multiple samples corresponding to the multiple sample data sets may constitute N feature columns, for example:
将a 11,a 21,a 31,……a M1填充至一个字段名对应的字段中,构成一特征列;将a 12,a 22,a 32,……a M2填充至一个字段名对应的字段中,构成一特征列…;将a 13,a 23,a 33,……a MN填充至一个字段名对应的字段中,构成一特征列。 Fill a 11 , a 21 , a 31 ,...a M1 into the field corresponding to a field name to form a feature column; fill a 12 , a 22 , a 32 ,...a M2 to the corresponding field name In the fields, a characteristic column is formed...; a 13 , a 23 , a 33 ,...a MN are filled into a field corresponding to a field name to form a characteristic column.
在示例性的实施例中,如图2所示,所述步骤S104进一步包括:In an exemplary embodiment, as shown in FIG. 2, the step S104 further includes:
S104a,以预设规则将所述多个样本数据集分为第一组样本数据集和第二组样本数据集。S104a: Divide the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to a preset rule.
示例性的,将所述多个样本数据集输入到RF(Random Forest,随机森林)分类模型中,将所述多个样本数据集对应的多个样本分类为第一类样本和第二类样本,所述第一类样本为老客户样本,所述第二类样本为新客户样本,因此,所述多个样本数据集被分为第一类样本对应的第一组样本数据集,和第二类样本对应的第二组样本数据集。不难理解,老客户样本的样本数据集的风险特征比较齐全,而新客户样本的样本数据集的风险特征可能有些不齐全。Exemplarily, input the multiple sample data sets into an RF (Random Forest) classification model, and classify multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample The samples of the first type are samples of old customers, and the samples of the second type are samples of new customers. Therefore, the multiple sample data sets are divided into the first group of sample data sets corresponding to the samples of the first type, and The second sample data set corresponding to the second type of sample. It is not difficult to understand that the risk characteristics of the sample data set of the old customer sample are relatively complete, while the risk characteristics of the sample data set of the new customer sample may be somewhat incomplete.
S104b,判断所述第二组样本数据集中的多个样本数据集中是否包括一个或多个数据缺失样本,所述数据缺失样本的样本数据集中包括一个或多个样本空白特征,所述样本空白特征是指对应于某个风险特征项的样本原始特征为空值。S104b: Determine whether the multiple sample data sets in the second set of sample data sets include one or more data missing samples, the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features It means that the original feature of the sample corresponding to a certain risk feature item is null.
S104c,如果是,选择第一组样本数据集中的多个样本数据集中的一个或多个样本原始特征填充到所述样本空白特征所对应的字段位置处。S104c, if yes, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the field positions corresponding to the sample blank features.
示例性的,通过所述第一组样本数集中各个样本构建KD树,并将所述数据缺失样本对应的样本原始特征输入到最近邻搜索(KD_tree,K-dimension tree)模型中,通过所述KD_tree模型查找到与所述数据缺失样本最邻近的目标样本,将该目标样本中与样本空白特征对应的目标数据填充到相应的字段位置处。Exemplarily, a KD tree is constructed from each sample in the first set of sample numbers, and the original features of the samples corresponding to the missing data samples are input into the nearest neighbor search (KD_tree, K-dimension tree) model, and the The KD_tree model finds the target sample closest to the missing data sample, and fills the target data corresponding to the sample blank feature in the target sample to the corresponding field position.
将所述数据缺失样本的样本数据集和第一组样本数据集中的多个样本数据集输入到随机森林分类模型中,得到第一组样本数据集中各个样本在决策树中对应的叶子节点的节点编号,其中,每个所述叶子节点具有唯一的节点编号。The sample data set of the missing data sample and the multiple sample data sets in the first set of sample data sets are input into the random forest classification model to obtain the node of the leaf node corresponding to each sample in the first set of sample data set in the decision tree Number, wherein each leaf node has a unique node number.
通过所述第一组样本数集中各个样本在决策树中对应的叶子节点的节点编号,构建KD树,并将所述数据缺失样本对应的叶子节点的节点编号输入到所述KD_tree模型中,通过所述KD_tree模型查找到与所述数据缺失样本最邻近的目标样本。Construct a KD tree based on the node number of the leaf node corresponding to each sample in the decision tree in the first set of sample numbers, and input the node number of the leaf node corresponding to the missing data sample into the KD_tree model, through The KD_tree model finds the target sample closest to the missing data sample.
本实施例可以解决部分样本的样本原始特征空白的问题。This embodiment can solve the problem of blank original features of some samples.
步骤S106,对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值(IV,information value)。Step S106: Analyze the original characteristics of the multiple samples corresponding to each risk feature item to obtain the information value (IV, information value) of each risk feature item.
信息值用于表示相应风险特征在风险评估中的预测准确度的影响程度。The information value is used to indicate the degree of influence of the corresponding risk feature on the prediction accuracy of the risk assessment.
以计算特征列(a 11,a 21,a 31,……a M1)对应的风险特征(客户年龄)的信息值为例: Take the information value of the risk characteristic (customer age) corresponding to the calculated characteristic column (a 11 , a 21 , a 31 ,...a M1 ) as an example:
Figure PCTCN2020093039-appb-000001
Figure PCTCN2020093039-appb-000001
IV i=WoE i*(Py i-Pn i) IV i =WoE i *(Py i -Pn i )
Figure PCTCN2020093039-appb-000002
Figure PCTCN2020093039-appb-000002
WoE i(Weight of Evidence,证据权重),是一种将数值做离散化处理的方式,WoE值表达的是变量取某个值时对核保风险评估结果的一种影响,Py i表示将特征列进行离散处理后,每个年龄区间的高风险保险数量与全部年龄区间的高风险保险数量之比;Pn i表示每个年龄区间的非高风险保险数量与全部年龄区间的非高风险保险数量之比。IV i表示每个年龄区间的信息值,IV表示该特征列的全部年龄区间的信息值。 WoE i (Weight of Evidence, weight of evidence) is a way to discretize values. WoE value expresses an impact on the results of underwriting risk assessment when a variable takes a certain value. Py i represents the characteristic After the column is discrete, the ratio of the number of high-risk insurance in each age range to the number of high-risk insurance in all age ranges; Pn i represents the number of non-high-risk insurance in each age range and the number of non-high-risk insurance in all age ranges Ratio. IV i represents the information value of each age range, and IV represents the information value of all age ranges in the feature column.
步骤S108,根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项。Step S108: According to the information value of each risk feature item, multiple target risk feature items are screened out from the multiple risk feature items.
通过步骤S106进行单变量分析,从而将所述多个风险特征项中筛选出部分风险特征项(即,多个目标风险特征项),该被筛选出的部分风险特征项对应的风险特征将被输入到迭代决策树模型中。不难理解,该步骤可以将无效特征项剔除的依据,以减轻训练负担。Perform univariate analysis through step S106, thereby screening some risk feature items (ie, multiple target risk feature items) from the multiple risk feature items, and the risk features corresponding to the selected risk feature items will be Input to the iterative decision tree model. It is not difficult to understand that this step can be a basis for eliminating invalid feature items to reduce the training burden.
步骤S110,将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型(Gradient Boosting Decision Tree,梯度提升决策树GBDT)中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合。Step S110: Input multiple original features of samples corresponding to multiple target risk feature items in each sample data set into an iterative decision tree model (Gradient Boosting Decision Tree, GBDT) to pass the iterative decision tree The model output corresponds to multiple risk feature combinations of the multiple sample data sets.
迭代决策树模型可以为GBDT(Gradient Boosting Decision Tree,梯度提升决策树)模型,其基于一种迭代的决策树算法,该决策树算法由多棵决策树组成,具体结构为:每一颗树拟合前K棵树的残差,及每一棵树都依赖前一棵树的结果,因此,决策树之间需要保证一定的顺序。这样,通过GBDT模型中的多棵决策树对所述多个样本数据集进行决策分类,从而可以找出所述多个样本数据集中各个风险特征之间的关联关系,并将具有关联关系的特征进行组合,得到风险特征组合。The iterative decision tree model can be a GBDT (Gradient Boosting Decision Tree) model, which is based on an iterative decision tree algorithm. The decision tree algorithm is composed of multiple decision trees. The specific structure is: The residuals of the previous K trees are combined, and each tree depends on the result of the previous tree. Therefore, a certain order between decision trees needs to be guaranteed. In this way, the multiple decision trees in the GBDT model are used to classify the multiple sample data sets, so as to find the association relationship between the various risk features in the multiple sample data sets, and combine the features with the association relationship. Combine to get a combination of risk characteristics.
具体地,GBDT模型中的每棵决策树包括根节点、中间节点和叶子节点。根节点和每个中间节点都有对应一个风险特征项(如年龄)和风险特征值(如年龄30岁),如果某个样本的客户年龄大于30岁,则将该样本分配到该节点的右子节点,否则分到左子节点,下层节点同理,直至该样本落到某叶子节点。根据该样本落在各个决策树上的叶子节点,得到该样本对应的风险特征组合。可以理解,当有多个样本时,会得到对应的多个风险特征组合。Specifically, each decision tree in the GBDT model includes a root node, an intermediate node, and a leaf node. The root node and each intermediate node have a corresponding risk feature item (such as age) and risk feature value (such as age 30). If a sample of customers is older than 30 years old, the sample will be assigned to the right of the node The child node, otherwise it is assigned to the left child node, and the lower node is the same, until the sample falls to a certain leaf node. According to the leaf nodes that the sample falls on each decision tree, the risk feature combination corresponding to the sample is obtained. It can be understood that when there are multiple samples, multiple corresponding risk feature combinations will be obtained.
步骤S112,根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型,所述多个目标模型包括LR(loss function,逻辑回归)模型、FM(Factorization Machine,因子分解机)模型和深度网络神经模型。Step S112: Train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model. The multiple target models include LR (loss function, logistic regression) model, FM (Factorization Machine, factorization machine) model and deep network neural model.
LR模型:具有较高的可解释性,将GBDT模型输出的所述多个风险特征组合作为LR模型的输入,也可以有效提高LR模型的评估效果;LR model: It has high interpretability. Using the multiple risk feature combinations output by the GBDT model as the input of the LR model can also effectively improve the evaluation effect of the LR model;
FM模型:将GBDT模型输出的所述多个风险特征组合作为FM模型的输入,可以在高度稀疏的条件下能够更好地挖掘风险特征项间的相关性,尤其在训练样本中没有出现的交叉数据的情况下。FM model: The multiple risk feature combinations output by the GBDT model are used as the input of the FM model, which can better mine the correlation between the risk feature items under highly sparse conditions, especially if there is no crossover in the training sample Case of data.
深度神经网络模型:相对LR模型而言可解释性较低,但具有评估精度高的优势,将GBDT模型输出的所述多个风险特征组合作为深度神经网络模型的输入,可以进步提高评估精度。Deep neural network model: Compared with the LR model, the interpretability is lower, but it has the advantage of high evaluation accuracy. Using the multiple risk feature combinations output by the GBDT model as the input of the deep neural network model can improve the evaluation accuracy.
其中,深度神经网络模型可以包括DNN或者ANN等。其中,DNN适合大数据和分布式训练,以训练DNN为例进行说明。Among them, the deep neural network model may include DNN or ANN. Among them, DNN is suitable for big data and distributed training, taking training DNN as an example for illustration.
DNN的训练过程:DNN的输入层用于输入GBDT模型输出的所述多个风险特征组合,而输出层即可输出预测风险系数。可以理解的是,针对每个样本数据集,即在将该样本数据集对应的所述多个风险特征组合输入到DNN之后,DNN都会输出相应的预测风险系数。若每个预测风险系数与相应样本的样本标签相符合的概率达到预设阈值,此处的预设阈值可以根据经验值设定,则可以认为已经得到了优化的DNN。DNN training process: The input layer of the DNN is used to input the multiple risk feature combinations output by the GBDT model, and the output layer can output predicted risk coefficients. It is understandable that, for each sample data set, that is, after the multiple risk feature combinations corresponding to the sample data set are input to the DNN, the DNN will output the corresponding predicted risk coefficient. If the probability that each predicted risk coefficient matches the sample label of the corresponding sample reaches a preset threshold, where the preset threshold can be set according to empirical values, it can be considered that an optimized DNN has been obtained.
实施例二Example two
请继续参阅图3,示出了本申请基于大数据的保单核保模型训练系统实施例二的程序 模块示意图。在本实施例中,基于大数据的保单核保模型训练系统20可以包括或被分割成一个或多个程序模块,一个或者多个程序模块被存储于存储介质中,并由一个或多个处理器所执行,以完成本申请,并可实现上述基于大数据的保单核保模型训练方法。本申请实施例所称的程序模块是指能够完成特定功能的一系列计算机可读指令段,比程序本身更适合于描述基于大数据的保单核保模型训练系统20在存储介质中的执行过程。以下描述将具体介绍本实施例各程序模块的功能:Please continue to refer to Fig. 3, which shows a schematic diagram of the program modules of the second embodiment of the insurance policy underwriting model training system based on big data of the present application. In this embodiment, the big data-based insurance policy underwriting model training system 20 may include or be divided into one or more program modules, one or more program modules are stored in a storage medium, and are composed of one or more It is executed by the processor to complete the application and realize the above-mentioned method for training the insurance policy underwriting model based on big data. The program module referred to in the embodiments of the present application refers to a series of computer-readable instruction segments that can complete specific functions, and is more suitable than the program itself to describe the execution process of the insurance policy underwriting model training system 20 based on big data in the storage medium. . The following description will specifically introduce the functions of each program module in this embodiment:
配置模块200,用于预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项。The configuration module 200 is configured to pre-configure a set of risk feature items, and the set of risk feature items includes multiple risk feature items.
获取模块202,用于基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征。The acquiring module 202 is configured to acquire multiple sample data sets of multiple customers from the customer information database based on the set of risk feature items, each sample data set includes multiple sample data sets corresponding to the customer and the multiple risk feature items feature.
填充模块204,用于将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中。The filling module 204 is used for filling the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items.
在示例性的实施例中,所述填充模块204还用于:以预设规则将所述多个样本数据集分为第一组样本数据集和第二组样本数据集;判断所述第二组样本数据集中的多个样本数据集中是否包括一个或多个数据缺失样本,所述数据缺失样本的样本数据集中包括一个或多个样本空白特征,所述样本空白特征是指对应风险特征项的样本原始特征为空值;如果所述第二组样本数据集中的多个样本数据集中包括一个或多个数据缺失样本,选择第一组样本数据集中的多个样本数据集中的一个或多个样本原始特征填充到所述样本空白特征所对应的字段位置处。In an exemplary embodiment, the filling module 204 is further configured to: divide the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules; Whether multiple sample data sets in the group sample data set include one or more missing data samples, the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to the corresponding risk feature items The original feature of the sample is a null value; if the multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more samples from the multiple sample data sets in the first set of sample data sets The original feature is filled into the field position corresponding to the blank feature of the sample.
其中,以预设规则将所述多个样本数据集分为第一组样本数据集和第二组样本数据集,具体包括以下:将所述多个样本数据集输入到随机森林分类模型中,将所述多个样本数据集对应的多个样本分类为第一类样本和第二类样本;其中,所述第一类样本对应于所述第一组样本数据集,所述第二类样本对应于所述第二组样本数据集。Wherein, dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to a preset rule specifically includes the following: inputting the multiple sample data sets into a random forest classification model, The multiple samples corresponding to the multiple sample data sets are classified into a first type of sample and a second type of sample; wherein the first type of sample corresponds to the first group of sample data set, and the second type of sample Corresponds to the second set of sample data sets.
其中,选择第一组样本数据集中的多个样本数据集中的一个或多个样本原始特征填充到所述样本空白特征所对应的字段位置处,具体包括如下:通过所述第一组样本数集中各个样本构建KD树;将所述数据缺失样本对应的样本原始特征输入到最近邻搜索模型中;通过所述最近邻搜索模型查找到与所述数据缺失样本最邻近的目标样本;将该目标样本中与样本空白特征对应的目标数据填充到相应的字段位置处;其中,所述最近邻搜索模型的KD树藉由所述第一组样本数据集中的各个样本构建而成。Wherein, selecting one or more sample original features in the multiple sample data sets in the first set of sample data sets to fill in the field positions corresponding to the blank features of the sample includes the following: Each sample constructs a KD tree; the original features of the sample corresponding to the missing data sample are input into the nearest neighbor search model; the nearest neighbor search model is used to find the target sample closest to the missing data sample; the target sample The target data corresponding to the blank feature of the sample is filled in the corresponding field position; wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data.
分析模块206,用于对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值。The analysis module 206 is configured to analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item.
筛选模块208,用于根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项。The screening module 208 is configured to screen out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item.
风险特征组合输出模块210,用于将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合。The risk feature combination output module 210 is configured to input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model, so as to output corresponding to the Multiple risk characteristic combinations of multiple sample data sets.
训练模块212,用于根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。所述多个目标模型可以包括逻辑回归模型、因子分解机模型和深度神经网络模型。The training module 212 is configured to train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model. The multiple target models may include a logistic regression model, a factorization machine model, and a deep neural network model.
实施例三Example three
参阅图4,是本申请实施例三之计算机设备的硬件架构示意图。本实施例中,所述计算机设备2是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。该计算机设备2可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。如图所示,所述计算机设备2至少包括,但不限于,可通过系统总线相互通信连接存储器21、处理器22、网络接 口23、以及基于大数据的保单核保模型训练系统20。其中:Refer to FIG. 4, which is a schematic diagram of the hardware architecture of the computer device in the third embodiment of the present application. In this embodiment, the computer device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. The computer device 2 may be a rack server, a blade server, a tower server, or a cabinet server (including an independent server, or a server cluster composed of multiple servers). As shown in the figure, the computer device 2 at least includes, but is not limited to, a memory 21, a processor 22, a network interface 23, and a big data-based insurance policy underwriting model training system 20 that can communicate with each other through a system bus. among them:
本实施例中,存储器21至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器21可以是计算机设备2的内部存储单元,例如该计算机设备2的硬盘或内存。在另一些实施例中,存储器21也可以是计算机设备2的外部存储设备,例如该计算机设备20上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器21还可以既包括计算机设备2的内部存储单元也包括其外部存储设备。本实施例中,存储器21通常用于存储安装于计算机设备2的操作系统和各类应用软件,例如实施例五的基于大数据的保单核保模型训练系统20的程序代码等。此外,存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。In this embodiment, the memory 21 includes at least one type of computer-readable storage medium. The readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, for example, a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the computer device 2, for example, the program code of the insurance policy underwriting model training system 20 based on big data in the fifth embodiment. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制计算机设备2的总体操作。本实施例中,处理器22用于运行存储器21中存储的程序代码或者处理数据,例如运行基于大数据的保单核保模型训练系统20,以实现实施例一的基于大数据的保单核保模型训练方法。The processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 22 is generally used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the big data-based insurance policy underwriting model training system 20, to implement the big data-based policy core of the first embodiment. Guarantee model training method.
所述网络接口23可包括无线网络接口或有线网络接口,该网络接口23通常用于在所述计算机设备2与其他电子装置之间建立通信连接。例如,所述网络接口23用于通过网络将所述计算机设备2与外部终端相连,在所述计算机设备2与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。The network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic devices. For example, the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal. The network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
需要指出的是,图4仅示出了具有部件20-23的计算机设备2,但是应理解的是,并不要求实施所有示出的部件,可以替代的实施更多或者更少的部件。It should be pointed out that FIG. 4 only shows the computer device 2 with components 20-23, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
在本实施例中,存储于存储器21中的所述基于大数据的保单核保模型训练系统20还可以被分割为一个或者多个程序模块,所述一个或者多个程序模块被存储于存储器21中,并由一个或多个处理器(本实施例为处理器22)所执行,以完成本申请。In this embodiment, the big data-based insurance policy underwriting model training system 20 stored in the memory 21 may also be divided into one or more program modules, and the one or more program modules are stored in the memory. 21, and executed by one or more processors (processor 22 in this embodiment) to complete the application.
例如,图3示出了所述实现基于大数据的保单核保模型训练系统20实施例二的程序模块示意图,该实施例中,所述基于大数据的保单核保模型训练系统20可以被划分为配置模块200、获取模块202、填充模块204、分析模块206、筛选模块208、风险特征组合输出模块210和训练模块212。其中,本申请所称的程序模块是指能够完成特定功能的一系列计算机可读指令段,比程序更适合于描述所述基于大数据的保单核保模型训练系统20在所述计算机设备2中的执行过程。所述程序模块200-212的具体功能在实施例二中已有详细描述,在此不再赘述。For example, FIG. 3 shows a schematic diagram of the program modules of the second embodiment of the big data-based insurance policy underwriting model training system 20. In this embodiment, the big data-based insurance policy underwriting model training system 20 can be It is divided into a configuration module 200, an acquisition module 202, a filling module 204, an analysis module 206, a screening module 208, a risk feature combination output module 210, and a training module 212. Among them, the program module referred to in this application refers to a series of computer-readable instruction segments that can complete specific functions, and is more suitable than a program to describe the big data-based insurance policy underwriting model training system 20 in the computer device 2 In the implementation process. The specific functions of the program modules 200-212 have been described in detail in the second embodiment, and will not be repeated here.
实施例四Example four
本实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机可读指令,程序被处理器执行时实现相应功能。本实施例的计算机可读存储介质用于存储基于大数据的保单核保模型训练系统20,被处理器执行如下步骤:This embodiment also provides a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX). Memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory , Magnetic disks, optical disks, servers, App application malls, etc., on which computer-readable instructions are stored, and corresponding functions are realized when the programs are executed by the processor. The computer-readable storage medium of this embodiment is used to store the insurance policy underwriting model training system 20 based on big data, and the processor executes the following steps:
预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;
基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项;Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and
根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
实施例五Example five
参阅图5,示出了本申请实施例五之核保风险评估方法的步骤流程图。可以理解,本方法实施例中的流程图不用于对执行步骤的顺序进行限定。具体如下。Referring to FIG. 5, it shows a flow chart of the steps of the underwriting risk assessment method of the fifth embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. details as follows.
步骤S200,获取目标客户的目标数据集,所述目标数据集中包括多个风险特征项对应的多个风险特征。Step S200: Obtain a target data set of a target customer, where the target data set includes multiple risk features corresponding to multiple risk feature items.
所述目标数据集中的多个风险特征,可以来自于目标客户的表单填写内容,也可以来自于公司内部对该目标客户的历史数据,或来自第三方数据库等。The multiple risk characteristics in the target data set may come from the fill-in content of the target customer's form, or from the company's internal historical data for the target customer, or from a third-party database.
步骤S202,判断所述目标客户的目标数据集中是否有空白风险特征。如果是,进入步骤S204;否则进入步骤S206。Step S202: Determine whether there are blank risk characteristics in the target data set of the target customer. If yes, go to step S204; otherwise, go to step S206.
步骤S204,通过最近邻搜索模型查找到与目标客户最邻近的目标样本,以将所述目标样本中的风险特征填充所述目标数据集的空白风险特征。Step S204: Find the target sample closest to the target customer through the nearest neighbor search model, so as to fill the blank risk characteristics of the target data set with the risk characteristics in the target sample.
步骤S206,将填充后的目标数据集输入到迭代决策树模型中。进入步骤S210。Step S206: Input the filled target data set into the iterative decision tree model. Proceed to step S210.
步骤S208,将步骤S200得到的所述目标数据集输入到迭代决策树模型中。进入步骤S210。Step S208: Input the target data set obtained in step S200 into the iterative decision tree model. Proceed to step S210.
步骤S210,通过所述迭代决策树模型输出对应的风险特征组合。Step S210, output the corresponding risk feature combination through the iterative decision tree model.
步骤S212,根据核保风险评估模型对所述风险特征组合进行预测以获取所述目标客户的风险系数。Step S212: Predict the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer.
所述核保风险评估模型包括逻辑回归模型、因子分解机模型和深度神经网络模型。步骤S212可以进一步包括:根据逻辑回归模型输出的第一风险系数、因子分解机模型输出的第二风险系数和深度神经网络模型输出的第三风险系数,计算所述目标客户的所述风险系数。The underwriting risk assessment model includes a logistic regression model, a factorization machine model and a deep neural network model. Step S212 may further include: calculating the risk coefficient of the target customer according to the first risk coefficient output by the logistic regression model, the second risk coefficient output by the factorization machine model, and the third risk coefficient output by the deep neural network model.
计算方式可以自定义,如,可以计算第一、第二、第三风险系数的均值,将该均值作为所述目标客户的所述风险系数。The calculation method can be customized. For example, the average value of the first, second, and third risk coefficients can be calculated, and the average value can be used as the risk coefficient of the target customer.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于大数据的保单核保模型训练方法,其中,所述方法包括:A method for training an insurance policy underwriting model based on big data, wherein the method includes:
    预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;
    基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
    将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
    对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
    根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项;Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
    将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and
    根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  2. 如权利要求1所述的保单核保模型训练方法,其中,将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中的步骤,包括:The method for training an insurance policy underwriting model according to claim 1, wherein the step of filling the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items comprises:
    以预设规则将所述多个样本数据集分为第一组样本数据集和第二组样本数据集;Dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules;
    判断所述第二组样本数据集中的多个样本数据集中是否包括一个或多个数据缺失样本,所述数据缺失样本的样本数据集中包括一个或多个样本空白特征,所述样本空白特征是指对应风险特征项的样本原始特征为空值;Determine whether the multiple sample data sets in the second set of sample data sets include one or more missing data samples, and the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to The original feature of the sample corresponding to the risk feature item is a null value;
    如果所述第二组样本数据集中的多个样本数据集中包括一个或多个数据缺失样本,选择第一组样本数据集中的多个样本数据集中的一个或多个样本原始特征填充到所述样本空白特征所对应的字段位置处。If multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the sample The position of the field corresponding to the blank feature.
  3. 如权利要求2所述的保单核保模型训练方法,其中,以预设规则将所述多个样本数据集分为第一组样本数据集和第二组样本数据集的步骤,包括:3. The insurance policy underwriting model training method of claim 2, wherein the step of dividing the plurality of sample data sets into a first group of sample data sets and a second group of sample data sets according to a preset rule comprises:
    将所述多个样本数据集输入到随机森林分类模型中,将所述多个样本数据集对应的多个样本分类为第一类样本和第二类样本;Inputting the multiple sample data sets into a random forest classification model, and classifying multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample;
    其中,所述第一类样本对应于所述第一组样本数据集,所述第二类样本对应于所述第二组样本数据集。Wherein, the first type of sample corresponds to the first group of sample data set, and the second type of sample corresponds to the second group of sample data set.
  4. 如权利要求2所述的保单核保模型训练方法,其中,选择第一组样本数据集中的多个样本数据集中的一个或多个样本原始特征填充到所述样本空白特征所对应的字段位置处的步骤,包括:The method for training an insurance policy underwriting model according to claim 2, wherein one or more original features of the samples in the multiple sample data sets in the first set of sample data sets are selected to fill in the field positions corresponding to the sample blank features The steps at the office include:
    通过所述第一组样本数集中各个样本构建KD树;Construct a KD tree from each sample in the first set of sample numbers;
    将所述数据缺失样本对应的样本原始特征输入到最近邻搜索模型中;Input the original features of the samples corresponding to the missing data samples into the nearest neighbor search model;
    通过所述最近邻搜索模型查找到与所述数据缺失样本最邻近的目标样本;Finding the closest target sample to the missing data sample through the nearest neighbor search model;
    将该目标样本中与样本空白特征对应的目标数据填充到相应的字段位置处;Fill the target data corresponding to the blank feature of the sample in the corresponding field position in the target sample;
    其中,所述最近邻搜索模型的KD树藉由所述第一组样本数据集中的各个样本构建而成。Wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data sets.
  5. 如权利要求1所述的保单核保模型训练方法,其中,所述多个目标模型包括逻辑回归模型、因子分解机模型和深度神经网络模型。The method for training a policy underwriting model according to claim 1, wherein the multiple target models include a logistic regression model, a factorization machine model, and a deep neural network model.
  6. 一种基于大数据的保单核保模型训练系统,包括:An insurance policy underwriting model training system based on big data includes:
    配置模块,用于预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;The configuration module is used to pre-configure a set of risk characteristic items, and the set of risk characteristic items includes multiple risk characteristic items;
    获取模块,用于基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;The acquiring module is used to acquire multiple sample data sets of multiple customers from the customer information database based on the risk feature item set, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items ;
    填充模块,用于将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;The filling module is used to fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
    分析模块,用于对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;The analysis module is used to analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
    筛选模块,用于根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项;The screening module is used to screen out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
    风险特征组合输出模块,用于将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及The risk feature combination output module is used to input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model, so as to output corresponding to the multiple through the iterative decision tree model Multiple risk feature combinations of a sample data set; and
    训练模块,用于根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。The training module is used to train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  7. 如权利要求6所述的保单核保模型训练系统,其中,所述填充模块,还用于:The insurance policy underwriting model training system according to claim 6, wherein the filling module is further used for:
    以预设规则将所述多个样本数据集分为第一组样本数据集和第二组样本数据集;Dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules;
    判断所述第二组样本数据集中的多个样本数据集中是否包括一个或多个数据缺失样本,所述数据缺失样本的样本数据集中包括一个或多个样本空白特征,所述样本空白特征是指对应风险特征项的样本原始特征为空值;Determine whether the multiple sample data sets in the second set of sample data sets include one or more missing data samples, and the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to The original feature of the sample corresponding to the risk feature item is a null value;
    如果所述第二组样本数据集中的多个样本数据集中包括一个或多个数据缺失样本,选择第一组样本数据集中的多个样本数据集中的一个或多个样本原始特征填充到所述样本空白特征所对应的字段位置处。If multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the sample The position of the field corresponding to the blank feature.
  8. 如权利要求7所述的保单核保模型训练系统,其中,所述填充模块,还用于:The insurance policy underwriting model training system according to claim 7, wherein the filling module is further used for:
    将所述多个样本数据集输入到随机森林分类模型中,将所述多个样本数据集对应的多个样本分类为第一类样本和第二类样本;Inputting the multiple sample data sets into a random forest classification model, and classifying multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample;
    其中,所述第一类样本对应于所述第一组样本数据集,所述第二类样本对应于所述第二组样本数据集。Wherein, the first type of sample corresponds to the first group of sample data set, and the second type of sample corresponds to the second group of sample data set.
  9. 一种计算机设备,其中,所述计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:A computer device, wherein the computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the computer-readable instructions are implemented when the processor is executed The following steps:
    预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;
    基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
    将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
    对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
    根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项;Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
    将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and
    根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  10. 如权利要求9所述的计算机设备,其中,所述计算机可读指令被处理器执行时还实现以下步骤:9. The computer device of claim 9, wherein the computer-readable instructions further implement the following steps when executed by the processor:
    以预设规则将所述多个样本数据集分为第一组样本数据集和第二组样本数据集;Dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules;
    判断所述第二组样本数据集中的多个样本数据集中是否包括一个或多个数据缺失样本,所述数据缺失样本的样本数据集中包括一个或多个样本空白特征,所述样本空白特征是指对应风险特征项的样本原始特征为空值;Determine whether the multiple sample data sets in the second set of sample data sets include one or more missing data samples, and the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to The original feature of the sample corresponding to the risk feature item is a null value;
    如果所述第二组样本数据集中的多个样本数据集中包括一个或多个数据缺失样本,选择第一组样本数据集中的多个样本数据集中的一个或多个样本原始特征填充到所述样本空 白特征所对应的字段位置处。If multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the sample The position of the field corresponding to the blank feature.
  11. 如权利要求10所述的计算机设备,其中,所述计算机可读指令被处理器执行时还实现以下步骤:11. The computer device of claim 10, wherein the computer-readable instructions further implement the following steps when executed by the processor:
    将所述多个样本数据集输入到随机森林分类模型中,将所述多个样本数据集对应的多个样本分类为第一类样本和第二类样本;Inputting the multiple sample data sets into a random forest classification model, and classifying multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample;
    其中,所述第一类样本对应于所述第一组样本数据集,所述第二类样本对应于所述第二组样本数据集。Wherein, the first type of sample corresponds to the first group of sample data set, and the second type of sample corresponds to the second group of sample data set.
  12. 如权利要求11所述的计算机设备,其中,所述计算机可读指令被处理器执行时还实现以下步骤:11. The computer device of claim 11, wherein the computer-readable instructions further implement the following steps when executed by the processor:
    通过所述第一组样本数集中各个样本构建KD树;Construct a KD tree from each sample in the first set of sample numbers;
    将所述数据缺失样本对应的样本原始特征输入到最近邻搜索模型中;Input the original features of the samples corresponding to the missing data samples into the nearest neighbor search model;
    通过所述最近邻搜索模型查找到与所述数据缺失样本最邻近的目标样本;Finding the closest target sample to the missing data sample through the nearest neighbor search model;
    将该目标样本中与样本空白特征对应的目标数据填充到相应的字段位置处;Fill the target data corresponding to the blank feature of the sample in the corresponding field position in the target sample;
    其中,所述最近邻搜索模型的KD树藉由所述第一组样本数据集中的各个样本构建而成。Wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data sets.
  13. 如权利要求10所述的计算机设备,其中,所述多个目标模型包括逻辑回归模型、因子分解机模型和深度神经网络模型。10. The computer device of claim 10, wherein the plurality of target models include a logistic regression model, a factorization machine model, and a deep neural network model.
  14. 一种计算机可读存储介质,其中,所述计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:A computer-readable storage medium, wherein computer-readable instructions are stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the following step:
    预先配置风险特征项集合,所述风险特征项集合中包括多个风险特征项;Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;
    基于所述风险特征项集合,从客户信息数据库中获取多个客户的多个样本数据集,每个样本数据集中包括对应客户与多个风险特征项对应的多个样本原始特征;Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;
    将每个样本数据集中的多个样本原始特征填充到对应的风险特征项的字段中;Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;
    对每个风险特征项对应的多个样本原始特征进行分析,得到每个风险特征项的信息值;Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;
    根据每个风险特征项的信息值,从所述多个风险特征项中筛选出多个目标风险特征项;Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;
    将每个样本数据集中的多个目标风险特征项对应的多个样本原始特征输入到迭代决策树模型中,以通过所述迭代决策树模型输出对应于所述多个样本数据集的多个风险特征组合;及Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and
    根据所述多个风险特征组合对多个目标模型中进行训练,以构建保单核保风险评估模型。Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
  15. 如权利要求14所述的计算机可读存储介质,其中,所述计算机可读指令还可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:The computer-readable storage medium according to claim 14, wherein the computer-readable instructions are also executable by at least one processor, so that the at least one processor executes the following steps:
    以预设规则将所述多个样本数据集分为第一组样本数据集和第二组样本数据集;Dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules;
    判断所述第二组样本数据集中的多个样本数据集中是否包括一个或多个数据缺失样本,所述数据缺失样本的样本数据集中包括一个或多个样本空白特征,所述样本空白特征是指对应风险特征项的样本原始特征为空值;Determine whether the multiple sample data sets in the second set of sample data sets include one or more missing data samples, and the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to The original feature of the sample corresponding to the risk feature item is a null value;
    如果所述第二组样本数据集中的多个样本数据集中包括一个或多个数据缺失样本,选择第一组样本数据集中的多个样本数据集中的一个或多个样本原始特征填充到所述样本空白特征所对应的字段位置处。If multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the sample The position of the field corresponding to the blank feature.
  16. 如权利要求15所述的计算机可读存储介质,其中,所述计算机可读指令还可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:15. The computer-readable storage medium of claim 15, wherein the computer-readable instructions are also executable by at least one processor, so that the at least one processor executes the following steps:
    将所述多个样本数据集输入到随机森林分类模型中,将所述多个样本数据集对应的多个样本分类为第一类样本和第二类样本;Inputting the multiple sample data sets into a random forest classification model, and classifying multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample;
    其中,所述第一类样本对应于所述第一组样本数据集,所述第二类样本对应于所述第二组样本数据集。Wherein, the first type of sample corresponds to the first group of sample data set, and the second type of sample corresponds to the second group of sample data set.
  17. 如权利要求15所述的计算机可读存储介质,其中,所述计算机可读指令还可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:15. The computer-readable storage medium of claim 15, wherein the computer-readable instructions are also executable by at least one processor, so that the at least one processor executes the following steps:
    通过所述第一组样本数集中各个样本构建KD树;Construct a KD tree from each sample in the first set of sample numbers;
    将所述数据缺失样本对应的样本原始特征输入到最近邻搜索模型中;Input the original features of the samples corresponding to the missing data samples into the nearest neighbor search model;
    通过所述最近邻搜索模型查找到与所述数据缺失样本最邻近的目标样本;Finding the closest target sample to the missing data sample through the nearest neighbor search model;
    将该目标样本中与样本空白特征对应的目标数据填充到相应的字段位置处;Fill the target data corresponding to the sample blank feature in the target sample to the corresponding field position;
    其中,所述最近邻搜索模型的KD树藉由所述第一组样本数据集中的各个样本构建而成。Wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data sets.
  18. 如权利要求14所述的计算机可读存储介质,其中,所述多个目标模型包括逻辑回归模型、因子分解机模型和深度神经网络模型。The computer-readable storage medium of claim 14, wherein the plurality of target models include a logistic regression model, a factorization machine model, and a deep neural network model.
  19. 一种核保风险评估方法,包括以下步骤:An underwriting risk assessment method, including the following steps:
    获取目标客户的目标数据集,所述目标数据集中包括多个风险特征项对应的多个风险特征;Acquiring a target data set of the target customer, the target data set including multiple risk characteristics corresponding to multiple risk characteristic items;
    判断所述目标客户的目标数据集中是否有空白风险特征;Determine whether there are blank risk characteristics in the target data set of the target customer;
    如果有空白风险特征,通过最近邻搜索模型查找到与目标客户最邻近的目标样本,以将所述目标样本中的风险特征填充所述目标数据集的空白风险特征;If there are blank risk characteristics, find the target sample closest to the target customer through the nearest neighbor search model, so as to fill the blank risk characteristics of the target data set with the risk characteristics in the target sample;
    将填充后的目标数据集输入到迭代决策树模型中;Input the filled target data set into the iterative decision tree model;
    通过所述迭代决策树模型输出对应的风险特征组合;Output the corresponding risk feature combination through the iterative decision tree model;
    根据核保风险评估模型对所述风险特征组合进行预测以获取所述目标客户的风险系数,所述核保风险评估模型根据权1所述的基于大数据的保单核保模型训练方法训练得到。Predict the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer. The underwriting risk assessment model is trained according to the training method of the insurance policy underwriting model based on big data described in right 1. .
  20. 如权利要求19所述的核保风险评估方法,所述核保风险评估模型包括逻辑回归模型、因子分解机模型和深度神经网络模型;The underwriting risk assessment method of claim 19, wherein the underwriting risk assessment model includes a logistic regression model, a factorization machine model, and a deep neural network model;
    根据核保风险评估模型对所述风险特征组合进行预测以获取所述目标客户的风险系数的步骤,包括:The step of predicting the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer includes:
    根据逻辑回归模型输出的第一风险系数、因子分解机模型输出的第二风险系数和深度神经网络模型输出的第三风险系数,计算所述目标客户的所述风险系数。The risk coefficient of the target customer is calculated according to the first risk coefficient output by the logistic regression model, the second risk coefficient output by the factorization machine model, and the third risk coefficient output by the deep neural network model.
PCT/CN2020/093039 2019-07-23 2020-05-28 Insurance policy underwriting model training method employing big data, and underwriting risk assessment method WO2021012783A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910665008.2 2019-07-23
CN201910665008.2A CN110516910B (en) 2019-07-23 2019-07-23 Insurance policy and insurance model training method and insurance risk assessment method based on big data

Publications (1)

Publication Number Publication Date
WO2021012783A1 true WO2021012783A1 (en) 2021-01-28

Family

ID=68623384

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093039 WO2021012783A1 (en) 2019-07-23 2020-05-28 Insurance policy underwriting model training method employing big data, and underwriting risk assessment method

Country Status (2)

Country Link
CN (1) CN110516910B (en)
WO (1) WO2021012783A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362179A (en) * 2021-06-30 2021-09-07 中国农业银行股份有限公司 Prediction method, device, equipment, storage medium and program product of transaction data
CN113807614A (en) * 2021-10-13 2021-12-17 航天信息股份有限公司 Enterprise tax risk prediction method
CN116664319A (en) * 2023-08-01 2023-08-29 北京力码科技有限公司 Financial policy classification system based on big data
CN117150369A (en) * 2023-10-30 2023-12-01 恒安标准人寿保险有限公司 Training method of overweight prediction model and electronic equipment

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516910B (en) * 2019-07-23 2023-05-26 平安科技(深圳)有限公司 Insurance policy and insurance model training method and insurance risk assessment method based on big data
CN111178498B (en) * 2019-12-09 2023-08-22 北京邮电大学 Stock fluctuation prediction method and device
CN111126476A (en) * 2019-12-19 2020-05-08 支付宝(杭州)信息技术有限公司 Homogeneous risk unit feature set generation method, device, equipment and medium
CN111553800B (en) * 2020-04-30 2023-08-25 上海商汤智能科技有限公司 Data processing method and device, electronic equipment and storage medium
CN111652504A (en) * 2020-06-01 2020-09-11 泰康保险集团股份有限公司 Data processing apparatus
CN112199374B (en) * 2020-09-29 2023-12-05 中国平安人寿保险股份有限公司 Data feature mining method for data missing and related equipment thereof
CN112419076A (en) * 2020-11-27 2021-02-26 好人生(上海)健康科技有限公司 Health insurance underwriting system and method based on big data and merchant insurance cloud platform
CN112561714B (en) * 2020-12-16 2024-03-08 中国平安人寿保险股份有限公司 Nuclear protection risk prediction method and device based on NLP technology and related equipment
CN113469584B (en) * 2021-09-02 2021-11-16 云账户技术(天津)有限公司 Risk management method and device for business service operation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109285075A (en) * 2017-07-19 2019-01-29 腾讯科技(深圳)有限公司 A kind of Claims Resolution methods of risk assessment, device and server
US20190043070A1 (en) * 2017-08-02 2019-02-07 Zestfinance, Inc. Systems and methods for providing machine learning model disparate impact information
CN109544166A (en) * 2018-11-05 2019-03-29 阿里巴巴集团控股有限公司 A kind of Risk Identification Method and device
CN110516910A (en) * 2019-07-23 2019-11-29 平安科技(深圳)有限公司 Declaration form core based on big data protects model training method and core protects methods of risk assessment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8489499B2 (en) * 2010-01-13 2013-07-16 Corelogic Solutions, Llc System and method of detecting and assessing multiple types of risks related to mortgage lending
CN106600417A (en) * 2016-11-09 2017-04-26 前海企保科技(深圳)有限公司 Underwriting method and device of property insurance policies
CN108269012A (en) * 2018-01-12 2018-07-10 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of risk score model
CN109978396A (en) * 2019-03-29 2019-07-05 深圳市人民医院 A kind of early screening system and method for risk case

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109285075A (en) * 2017-07-19 2019-01-29 腾讯科技(深圳)有限公司 A kind of Claims Resolution methods of risk assessment, device and server
US20190043070A1 (en) * 2017-08-02 2019-02-07 Zestfinance, Inc. Systems and methods for providing machine learning model disparate impact information
CN109544166A (en) * 2018-11-05 2019-03-29 阿里巴巴集团控股有限公司 A kind of Risk Identification Method and device
CN110516910A (en) * 2019-07-23 2019-11-29 平安科技(深圳)有限公司 Declaration form core based on big data protects model training method and core protects methods of risk assessment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362179A (en) * 2021-06-30 2021-09-07 中国农业银行股份有限公司 Prediction method, device, equipment, storage medium and program product of transaction data
CN113362179B (en) * 2021-06-30 2024-01-30 中国农业银行股份有限公司 Method, apparatus, device, storage medium and program product for predicting transaction data
CN113807614A (en) * 2021-10-13 2021-12-17 航天信息股份有限公司 Enterprise tax risk prediction method
CN116664319A (en) * 2023-08-01 2023-08-29 北京力码科技有限公司 Financial policy classification system based on big data
CN117150369A (en) * 2023-10-30 2023-12-01 恒安标准人寿保险有限公司 Training method of overweight prediction model and electronic equipment
CN117150369B (en) * 2023-10-30 2024-01-26 恒安标准人寿保险有限公司 Training method of overweight prediction model and electronic equipment

Also Published As

Publication number Publication date
CN110516910B (en) 2023-05-26
CN110516910A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
WO2021012783A1 (en) Insurance policy underwriting model training method employing big data, and underwriting risk assessment method
US20240112209A1 (en) Systems and methods for providing machine learning model disparate impact information
US10937089B2 (en) Machine learning classification and prediction system
US10127477B2 (en) Distributed event prediction and machine learning object recognition system
Moffatt Hurdle models of loan default
WO2019047790A1 (en) Method and system for generating combined features of machine learning samples
US9390142B2 (en) Guided predictive analysis with the use of templates
WO2020211357A1 (en) Data association analysis method and apparatus, and computer device and storage medium
US11481603B1 (en) System for deep learning using knowledge graphs
CN109064343B (en) Risk model building method, risk matching device, risk model building equipment and risk matching medium
De Nicola et al. Evaluating Italian public hospital efficiency using bootstrap DEA and CART
CN112989059A (en) Method and device for identifying potential customer, equipment and readable computer storage medium
CN115630221A (en) Terminal application interface display data processing method and device and computer equipment
CN114495137B (en) Bill abnormity detection model generation method and bill abnormity detection method
US20230088044A1 (en) End-to-end prospecting platform utilizing natural language processing to reverse engineer client lists
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
CN114693409A (en) Product matching method, device, computer equipment, storage medium and program product
US20140324524A1 (en) Evolving a capped customer linkage model using genetic models
US20140324523A1 (en) Missing String Compensation In Capped Customer Linkage Model
CN112529319A (en) Grading method and device based on multi-dimensional features, computer equipment and storage medium
Wang et al. Clustered Coefficient Regression Models for Poisson Process with an Application to Seasonal Warranty Claim Data
Shahraz et al. Improving the quality of road injury statistics by using regression models to redistribute ill-defined events
CN115660722B (en) Prediction method and device for silver life customer conversion and electronic equipment
US20220318327A1 (en) Ranking similar users based on values and personal journeys
Ward et al. An improved comorbidity summary score for measuring disease burden and predicting mortality with applications to two national cohorts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20844231

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20844231

Country of ref document: EP

Kind code of ref document: A1