WO2021012783A1

WO2021012783A1 - Insurance policy underwriting model training method employing big data, and underwriting risk assessment method

Info

Publication number: WO2021012783A1
Application number: PCT/CN2020/093039
Authority: WO
Inventors: 王进; 刘行行
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-07-23
Filing date: 2020-05-28
Publication date: 2021-01-28
Also published as: CN110516910B; CN110516910A

Abstract

An insurance policy underwriting model training method employing big data comprises: acquiring multiple sample data sets of multiple customers on the basis of a pre-configured risk feature item set, wherein each sample data set comprises multiple sample original features corresponding to a customer and multiple risk feature items; adding the multiple sample original features in each sample data set to a field of a corresponding risk feature item (S104); acquiring multiple target risk feature items from the multiple risk feature items by means of filtering and on the basis of an information value of each risk feature item (S108); inputting the multiple sample original features corresponding to the multiple target risk feature items in each sample data set into an iterative decision tree model so as to output, by means of the iterative decision tree model, multiple risk feature combinations corresponding to the multiple sample data sets (S110); and training multiple target models according to the multiple risk feature combinations so as to construct an insurance policy underwriting risk assessment model (S112). The method has high assessment accuracy for underwriting risk assessment.

Description

Big data-based insurance policy underwriting model training method and underwriting risk assessment method

This application affirms the priority of the Chinese patent application filed with the Chinese Patent Office on July 23, 2019, the application number is 201910665008.2, and the invention title is "Big data-based insurance policy underwriting model training method and underwriting risk assessment method", which The entire content is incorporated into this application by reference.

Technical field

The embodiments of the present application relate to the field of artificial intelligence, and in particular to a method, system, computer equipment, computer-readable storage medium, and underwriting risk assessment method based on big data for insurance policy underwriting model training.

Background technique

As people's insurance awareness gradually increases, commercial insurance has become an important part of the current social security system. According to available reference data, the number of insurance policies of some insurance institutions is in the tens of millions. After these insurance policies are generated in the insurance system, they need to be underwritten to determine whether the information in the insurance policy meets the requirements for insurance participation. The current way of underwriting insurance policies is generally manual underwriting. For example, based on risk control rules and product pricing information of different customer groups, auxiliary information (physical examination information, health survey information, financial survey information), etc., insurance policies are manually reviewed.

However, with the rapid development of big data mining, more and more reference data is used for underwriting. The inventor realizes that if the underwriting is performed manually, it will not only cause a lot of manpower consumption, but also the underwriting efficiency is low, and it is difficult to effectively use the multi-dimensional data, resulting in low underwriting risk accuracy. Therefore, how to conduct data modeling based on big data and underwrite insurance policies through data models is one of the current research directions.

Summary of the invention

In view of this, the purpose of the embodiments of the present application is to provide a method, system, computer equipment, and computer-readable storage medium for training an insurance policy underwriting model based on big data, which can solve traditional data mining and data modeling underwriting The problem of low risk accuracy.

In order to achieve the foregoing objectives, the embodiment of the present application provides a method for training an insurance policy underwriting model based on big data, including the following steps:

Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;

Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;

Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;

Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;

Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;

Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and

Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.

In order to achieve the above objectives, the embodiment of the present application also provides a big data-based insurance policy underwriting model training system, including:

The configuration module is used to pre-configure a set of risk characteristic items, and the set of risk characteristic items includes multiple risk characteristic items;

The acquiring module is used to acquire multiple sample data sets of multiple customers from the customer information database based on the risk feature item set, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items ；

The filling module is used to fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;

The analysis module is used to analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;

The screening module is used to screen out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;

The risk feature combination output module is used to input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model, so as to output corresponding to the multiple through the iterative decision tree model Multiple risk feature combinations of a sample data set; and

The training module is used to train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.

In order to achieve the foregoing objective, an embodiment of the present application further provides a computer device, the computer device including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the When the computer-readable instructions are executed by the processor, the following steps are implemented:

In order to achieve the foregoing objective, an embodiment of the present application also provides a computer-readable storage medium having computer-readable instructions stored in the computer-readable storage medium, and the computer-readable instructions may be executed by at least one processor, So that the at least one processor executes the following steps:

In order to achieve the foregoing objectives, the embodiments of the present application also provide an underwriting risk assessment method, which includes the following steps:

Acquiring a target data set of the target customer, the target data set including multiple risk characteristics corresponding to multiple risk characteristic items;

Determine whether there are blank risk characteristics in the target data set of the target customer;

If there are blank risk characteristics, find the target sample closest to the target customer through the nearest neighbor search model, so as to fill the blank risk characteristics of the target data set with the risk characteristics in the target sample;

Input the filled target data set into the iterative decision tree model;

Output the corresponding risk feature combination through the iterative decision tree model;

The risk feature combination is predicted according to the underwriting risk assessment model to obtain the risk coefficient of the target customer, and the underwriting risk assessment model is obtained by training the above-mentioned big data-based insurance policy underwriting model training method.

Preferably, the underwriting risk assessment model includes a logistic regression model, a factorization machine model, and a deep neural network model;

The step of predicting the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer includes:

The risk coefficient of the target customer is calculated according to the first risk coefficient output by the logistic regression model, the second risk coefficient output by the factorization machine model, and the third risk coefficient output by the deep neural network model.

The big data-based insurance policy underwriting model training method, system, computer equipment, computer-readable storage medium, and underwriting risk assessment method provided by the embodiments of the application provide iterative decision-making based on the sample data set and each risk characteristic information value The multiple risk feature combinations output by the tree model are inputted into multiple target models to construct a policy underwriting risk assessment model. The constructed policy underwriting risk assessment model also has multiple target models. Data evaluation advantage, with high evaluation accuracy for underwriting risk evaluation.

Description of the drawings

FIG. 1 is a flowchart of Embodiment 1 of a method for training an insurance policy underwriting model based on big data in this application.

Fig. 2 is a flowchart of step S104 in Fig. 1.

FIG. 3 is a schematic diagram of program modules of Embodiment 2 of the insurance policy underwriting model training system based on big data of this application.

FIG. 4 is a schematic diagram of the hardware structure of the third embodiment of the computer equipment of this application.

Figure 5 is a flowchart of Embodiment 5 of the method for underwriting risk assessment of the application.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and not used to limit the application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be noted that the descriptions related to "first", "second", etc. in this application are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features. In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Not within the scope of protection required by this application.

The following embodiments will exemplarily describe the computer device 2 as the execution subject.

Example one

Referring to FIG. 1, it shows a flowchart of steps of a method for training an insurance policy underwriting model based on big data in Embodiment 1 of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. details as follows.

Step S100: Pre-configure a set of risk feature items, and the set of risk feature items includes multiple risk feature items.

Exemplarily, the set of risk feature items may include multiple sub-sets, such as: customer-related risk feature sub-set, insurance policy risk feature sub-set, salesperson risk feature sub-set, related information risk feature sub-set, Internet risk feature sub-set Wait. The customer-related risk sub-set may include: basic customer information (gender, age, occupation, education, etc.), social security information, and social relationship with the beneficiary. The insurance policy risk characteristic sub-set may include: policy insurance amount, insurance type information, etc. The sub-set of salesperson risk characteristics may include basic information (salesperson's gender, age, and years of experience), sales habits, commissions, product sales data, team members, attendance information, quality information, and so on. The associated information risk feature subset may include family information, etc. The Internet risk feature subset may include purchase behavior information, product-related information, and so on.

It should be noted that the multiple risk feature items in the above risk feature item set can be customized by the user, or can be obtained through an unsupervised neural network model analysis for feature classification.

Step S102: Acquire multiple sample data sets of multiple customers from the customer information database based on the risk feature item set, and each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items.

For example, to obtain M sample data sets corresponding to M customers, each sample data set includes N sample original features corresponding to N risk feature items, and the M sample data sets are:

A ₁ (a ₁₁ , a ₁₂ , a ₁₃ ,……a _1N )

A ₂ (a ₂₁ , a ₂₂ , a ₂₃ ,……a _2N )

...

A _M (a _M1 , a _M2 , a _M3 ,……a _MN )

In step S104, the original features of multiple samples in each sample data set are filled into the fields of the corresponding risk feature items.

The original features of the multiple samples corresponding to the multiple sample data sets may constitute N feature columns, for example:

Fill a ₁₁ , a ₂₁ , a ₃₁ ,...a _M1 into the field corresponding to a field name to form a feature column; fill a ₁₂ , a ₂₂ , a ₃₂ ,...a _M2 to the corresponding field name In the fields, a characteristic column is formed...; a ₁₃ , a ₂₃ , a ₃₃ ,...a _{MN are} filled into a field corresponding to a field name to form a characteristic column.

In an exemplary embodiment, as shown in FIG. 2, the step S104 further includes:

S104a: Divide the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to a preset rule.

Exemplarily, input the multiple sample data sets into an RF (Random Forest) classification model, and classify multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample The samples of the first type are samples of old customers, and the samples of the second type are samples of new customers. Therefore, the multiple sample data sets are divided into the first group of sample data sets corresponding to the samples of the first type, and The second sample data set corresponding to the second type of sample. It is not difficult to understand that the risk characteristics of the sample data set of the old customer sample are relatively complete, while the risk characteristics of the sample data set of the new customer sample may be somewhat incomplete.

S104b: Determine whether the multiple sample data sets in the second set of sample data sets include one or more data missing samples, the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features It means that the original feature of the sample corresponding to a certain risk feature item is null.

S104c, if yes, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the field positions corresponding to the sample blank features.

Exemplarily, a KD tree is constructed from each sample in the first set of sample numbers, and the original features of the samples corresponding to the missing data samples are input into the nearest neighbor search (KD_tree, K-dimension tree) model, and the The KD_tree model finds the target sample closest to the missing data sample, and fills the target data corresponding to the sample blank feature in the target sample to the corresponding field position.

The sample data set of the missing data sample and the multiple sample data sets in the first set of sample data sets are input into the random forest classification model to obtain the node of the leaf node corresponding to each sample in the first set of sample data set in the decision tree Number, wherein each leaf node has a unique node number.

Construct a KD tree based on the node number of the leaf node corresponding to each sample in the decision tree in the first set of sample numbers, and input the node number of the leaf node corresponding to the missing data sample into the KD_tree model, through The KD_tree model finds the target sample closest to the missing data sample.

This embodiment can solve the problem of blank original features of some samples.

Step S106: Analyze the original characteristics of the multiple samples corresponding to each risk feature item to obtain the information value (IV, information value) of each risk feature item.

The information value is used to indicate the degree of influence of the corresponding risk feature on the prediction accuracy of the risk assessment.

Take the information value of the risk characteristic (customer age) corresponding to the calculated characteristic column (a ₁₁ , a ₂₁ , a ₃₁ ,...a _M1 ) as an example:

IV _i =WoE _i *(Py _i -Pn _i )

WoE _i (Weight of Evidence, weight of evidence) is a way to discretize values. WoE value expresses an impact on the results of underwriting risk assessment when a variable takes a certain value. Py _i represents the characteristic After the column is discrete, the ratio of the number of high-risk insurance in each age range to the number of high-risk insurance in all age ranges; Pn _i represents the number of non-high-risk insurance in each age range and the number of non-high-risk insurance in all age ranges Ratio. IV _i represents the information value of each age range, and IV represents the information value of all age ranges in the feature column.

Step S108: According to the information value of each risk feature item, multiple target risk feature items are screened out from the multiple risk feature items.

Perform univariate analysis through step S106, thereby screening some risk feature items (ie, multiple target risk feature items) from the multiple risk feature items, and the risk features corresponding to the selected risk feature items will be Input to the iterative decision tree model. It is not difficult to understand that this step can be a basis for eliminating invalid feature items to reduce the training burden.

Step S110: Input multiple original features of samples corresponding to multiple target risk feature items in each sample data set into an iterative decision tree model (Gradient Boosting Decision Tree, GBDT) to pass the iterative decision tree The model output corresponds to multiple risk feature combinations of the multiple sample data sets.

The iterative decision tree model can be a GBDT (Gradient Boosting Decision Tree) model, which is based on an iterative decision tree algorithm. The decision tree algorithm is composed of multiple decision trees. The specific structure is: The residuals of the previous K trees are combined, and each tree depends on the result of the previous tree. Therefore, a certain order between decision trees needs to be guaranteed. In this way, the multiple decision trees in the GBDT model are used to classify the multiple sample data sets, so as to find the association relationship between the various risk features in the multiple sample data sets, and combine the features with the association relationship. Combine to get a combination of risk characteristics.

Specifically, each decision tree in the GBDT model includes a root node, an intermediate node, and a leaf node. The root node and each intermediate node have a corresponding risk feature item (such as age) and risk feature value (such as age 30). If a sample of customers is older than 30 years old, the sample will be assigned to the right of the node The child node, otherwise it is assigned to the left child node, and the lower node is the same, until the sample falls to a certain leaf node. According to the leaf nodes that the sample falls on each decision tree, the risk feature combination corresponding to the sample is obtained. It can be understood that when there are multiple samples, multiple corresponding risk feature combinations will be obtained.

Step S112: Train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model. The multiple target models include LR (loss function, logistic regression) model, FM (Factorization Machine, factorization machine) model and deep network neural model.

LR model: It has high interpretability. Using the multiple risk feature combinations output by the GBDT model as the input of the LR model can also effectively improve the evaluation effect of the LR model;

FM model: The multiple risk feature combinations output by the GBDT model are used as the input of the FM model, which can better mine the correlation between the risk feature items under highly sparse conditions, especially if there is no crossover in the training sample Case of data.

Deep neural network model: Compared with the LR model, the interpretability is lower, but it has the advantage of high evaluation accuracy. Using the multiple risk feature combinations output by the GBDT model as the input of the deep neural network model can improve the evaluation accuracy.

Among them, the deep neural network model may include DNN or ANN. Among them, DNN is suitable for big data and distributed training, taking training DNN as an example for illustration.

DNN training process: The input layer of the DNN is used to input the multiple risk feature combinations output by the GBDT model, and the output layer can output predicted risk coefficients. It is understandable that, for each sample data set, that is, after the multiple risk feature combinations corresponding to the sample data set are input to the DNN, the DNN will output the corresponding predicted risk coefficient. If the probability that each predicted risk coefficient matches the sample label of the corresponding sample reaches a preset threshold, where the preset threshold can be set according to empirical values, it can be considered that an optimized DNN has been obtained.

Example two

Please continue to refer to Fig. 3, which shows a schematic diagram of the program modules of the second embodiment of the insurance policy underwriting model training system based on big data of the present application. In this embodiment, the big data-based insurance policy underwriting model training system 20 may include or be divided into one or more program modules, one or more program modules are stored in a storage medium, and are composed of one or more It is executed by the processor to complete the application and realize the above-mentioned method for training the insurance policy underwriting model based on big data. The program module referred to in the embodiments of the present application refers to a series of computer-readable instruction segments that can complete specific functions, and is more suitable than the program itself to describe the execution process of the insurance policy underwriting model training system 20 based on big data in the storage medium. . The following description will specifically introduce the functions of each program module in this embodiment:

The configuration module 200 is configured to pre-configure a set of risk feature items, and the set of risk feature items includes multiple risk feature items.

The acquiring module 202 is configured to acquire multiple sample data sets of multiple customers from the customer information database based on the set of risk feature items, each sample data set includes multiple sample data sets corresponding to the customer and the multiple risk feature items feature.

The filling module 204 is used for filling the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items.

In an exemplary embodiment, the filling module 204 is further configured to: divide the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules; Whether multiple sample data sets in the group sample data set include one or more missing data samples, the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to the corresponding risk feature items The original feature of the sample is a null value; if the multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more samples from the multiple sample data sets in the first set of sample data sets The original feature is filled into the field position corresponding to the blank feature of the sample.

Wherein, dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to a preset rule specifically includes the following: inputting the multiple sample data sets into a random forest classification model, The multiple samples corresponding to the multiple sample data sets are classified into a first type of sample and a second type of sample; wherein the first type of sample corresponds to the first group of sample data set, and the second type of sample Corresponds to the second set of sample data sets.

Wherein, selecting one or more sample original features in the multiple sample data sets in the first set of sample data sets to fill in the field positions corresponding to the blank features of the sample includes the following: Each sample constructs a KD tree; the original features of the sample corresponding to the missing data sample are input into the nearest neighbor search model; the nearest neighbor search model is used to find the target sample closest to the missing data sample; the target sample The target data corresponding to the blank feature of the sample is filled in the corresponding field position; wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data.

The analysis module 206 is configured to analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item.

The screening module 208 is configured to screen out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item.

The risk feature combination output module 210 is configured to input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model, so as to output corresponding to the Multiple risk characteristic combinations of multiple sample data sets.

The training module 212 is configured to train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model. The multiple target models may include a logistic regression model, a factorization machine model, and a deep neural network model.

Example three

Refer to FIG. 4, which is a schematic diagram of the hardware architecture of the computer device in the third embodiment of the present application. In this embodiment, the computer device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. The computer device 2 may be a rack server, a blade server, a tower server, or a cabinet server (including an independent server, or a server cluster composed of multiple servers). As shown in the figure, the computer device 2 at least includes, but is not limited to, a memory 21, a processor 22, a network interface 23, and a big data-based insurance policy underwriting model training system 20 that can communicate with each other through a system bus. among them:

In this embodiment, the memory 21 includes at least one type of computer-readable storage medium. The readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, for example, a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the computer device 2, for example, the program code of the insurance policy underwriting model training system 20 based on big data in the fifth embodiment. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.

The processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 22 is generally used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the big data-based insurance policy underwriting model training system 20, to implement the big data-based policy core of the first embodiment. Guarantee model training method.

The network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic devices. For example, the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal. The network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.

It should be pointed out that FIG. 4 only shows the computer device 2 with components 20-23, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.

In this embodiment, the big data-based insurance policy underwriting model training system 20 stored in the memory 21 may also be divided into one or more program modules, and the one or more program modules are stored in the memory. 21, and executed by one or more processors (processor 22 in this embodiment) to complete the application.

For example, FIG. 3 shows a schematic diagram of the program modules of the second embodiment of the big data-based insurance policy underwriting model training system 20. In this embodiment, the big data-based insurance policy underwriting model training system 20 can be It is divided into a configuration module 200, an acquisition module 202, a filling module 204, an analysis module 206, a screening module 208, a risk feature combination output module 210, and a training module 212. Among them, the program module referred to in this application refers to a series of computer-readable instruction segments that can complete specific functions, and is more suitable than a program to describe the big data-based insurance policy underwriting model training system 20 in the computer device 2 In the implementation process. The specific functions of the program modules 200-212 have been described in detail in the second embodiment, and will not be repeated here.

Example four

This embodiment also provides a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX). Memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory , Magnetic disks, optical disks, servers, App application malls, etc., on which computer-readable instructions are stored, and corresponding functions are realized when the programs are executed by the processor. The computer-readable storage medium of this embodiment is used to store the insurance policy underwriting model training system 20 based on big data, and the processor executes the following steps:

Example five

Referring to FIG. 5, it shows a flow chart of the steps of the underwriting risk assessment method of the fifth embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. details as follows.

Step S200: Obtain a target data set of a target customer, where the target data set includes multiple risk features corresponding to multiple risk feature items.

The multiple risk characteristics in the target data set may come from the fill-in content of the target customer's form, or from the company's internal historical data for the target customer, or from a third-party database.

Step S202: Determine whether there are blank risk characteristics in the target data set of the target customer. If yes, go to step S204; otherwise, go to step S206.

Step S204: Find the target sample closest to the target customer through the nearest neighbor search model, so as to fill the blank risk characteristics of the target data set with the risk characteristics in the target sample.

Step S206: Input the filled target data set into the iterative decision tree model. Proceed to step S210.

Step S208: Input the target data set obtained in step S200 into the iterative decision tree model. Proceed to step S210.

Step S210, output the corresponding risk feature combination through the iterative decision tree model.

Step S212: Predict the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer.

The underwriting risk assessment model includes a logistic regression model, a factorization machine model and a deep neural network model. Step S212 may further include: calculating the risk coefficient of the target customer according to the first risk coefficient output by the logistic regression model, the second risk coefficient output by the factorization machine model, and the third risk coefficient output by the deep neural network model.

The calculation method can be customized. For example, the average value of the first, second, and third risk coefficients can be calculated, and the average value can be used as the risk coefficient of the target customer.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A method for training an insurance policy underwriting model based on big data, wherein the method includes:

Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;

Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;

Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;

Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;

Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;

Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and

Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
The method for training an insurance policy underwriting model according to claim 1, wherein the step of filling the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items comprises:

Dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules;

Determine whether the multiple sample data sets in the second set of sample data sets include one or more missing data samples, and the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to The original feature of the sample corresponding to the risk feature item is a null value;

If multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the sample The position of the field corresponding to the blank feature.
3. The insurance policy underwriting model training method of claim 2, wherein the step of dividing the plurality of sample data sets into a first group of sample data sets and a second group of sample data sets according to a preset rule comprises:

Inputting the multiple sample data sets into a random forest classification model, and classifying multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample;

Wherein, the first type of sample corresponds to the first group of sample data set, and the second type of sample corresponds to the second group of sample data set.
The method for training an insurance policy underwriting model according to claim 2, wherein one or more original features of the samples in the multiple sample data sets in the first set of sample data sets are selected to fill in the field positions corresponding to the sample blank features The steps at the office include:

Construct a KD tree from each sample in the first set of sample numbers;

Input the original features of the samples corresponding to the missing data samples into the nearest neighbor search model;

Finding the closest target sample to the missing data sample through the nearest neighbor search model;

Fill the target data corresponding to the blank feature of the sample in the corresponding field position in the target sample;

Wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data sets.
The method for training a policy underwriting model according to claim 1, wherein the multiple target models include a logistic regression model, a factorization machine model, and a deep neural network model.
An insurance policy underwriting model training system based on big data includes:

The configuration module is used to pre-configure a set of risk characteristic items, and the set of risk characteristic items includes multiple risk characteristic items;

The acquiring module is used to acquire multiple sample data sets of multiple customers from the customer information database based on the risk feature item set, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items ；

The filling module is used to fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;

The analysis module is used to analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;

The screening module is used to screen out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;

The risk feature combination output module is used to input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model, so as to output corresponding to the multiple through the iterative decision tree model Multiple risk feature combinations of a sample data set; and

The training module is used to train multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
The insurance policy underwriting model training system according to claim 6, wherein the filling module is further used for:

Dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules;

Determine whether the multiple sample data sets in the second set of sample data sets include one or more missing data samples, and the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to The original feature of the sample corresponding to the risk feature item is a null value;

If multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the sample The position of the field corresponding to the blank feature.
The insurance policy underwriting model training system according to claim 7, wherein the filling module is further used for:

Inputting the multiple sample data sets into a random forest classification model, and classifying multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample;

Wherein, the first type of sample corresponds to the first group of sample data set, and the second type of sample corresponds to the second group of sample data set.
A computer device, wherein the computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the computer-readable instructions are implemented when the processor is executed The following steps:

Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;

Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;

Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;

Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;

Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;

Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and

Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
9. The computer device of claim 9, wherein the computer-readable instructions further implement the following steps when executed by the processor:

Dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules;

Determine whether the multiple sample data sets in the second set of sample data sets include one or more missing data samples, and the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to The original feature of the sample corresponding to the risk feature item is a null value;

If multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the sample The position of the field corresponding to the blank feature.
11. The computer device of claim 10, wherein the computer-readable instructions further implement the following steps when executed by the processor:

Inputting the multiple sample data sets into a random forest classification model, and classifying multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample;

Wherein, the first type of sample corresponds to the first group of sample data set, and the second type of sample corresponds to the second group of sample data set.
11. The computer device of claim 11, wherein the computer-readable instructions further implement the following steps when executed by the processor:

Construct a KD tree from each sample in the first set of sample numbers;

Input the original features of the samples corresponding to the missing data samples into the nearest neighbor search model;

Finding the closest target sample to the missing data sample through the nearest neighbor search model;

Fill the target data corresponding to the blank feature of the sample in the corresponding field position in the target sample;

Wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data sets.
10. The computer device of claim 10, wherein the plurality of target models include a logistic regression model, a factorization machine model, and a deep neural network model.
A computer-readable storage medium, wherein computer-readable instructions are stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the following step:

Pre-configure a set of risk feature items, the set of risk feature items includes multiple risk feature items;

Based on the risk feature item set, multiple sample data sets of multiple customers are obtained from the customer information database, each sample data set includes multiple sample original features corresponding to the customer and the multiple risk feature items;

Fill the original features of multiple samples in each sample data set into the fields of the corresponding risk feature items;

Analyze the original characteristics of multiple samples corresponding to each risk feature item to obtain the information value of each risk feature item;

Filter out multiple target risk feature items from the multiple risk feature items according to the information value of each risk feature item;

Input multiple sample original features corresponding to multiple target risk feature items in each sample data set into the iterative decision tree model to output multiple risks corresponding to the multiple sample data sets through the iterative decision tree model Feature combination; and

Training is performed on multiple target models according to the multiple risk feature combinations to construct an insurance policy underwriting risk assessment model.
The computer-readable storage medium according to claim 14, wherein the computer-readable instructions are also executable by at least one processor, so that the at least one processor executes the following steps:

Dividing the multiple sample data sets into a first group of sample data sets and a second group of sample data sets according to preset rules;

Determine whether the multiple sample data sets in the second set of sample data sets include one or more missing data samples, and the sample data set of the missing data samples includes one or more sample blank features, and the sample blank features refer to The original feature of the sample corresponding to the risk feature item is a null value;

If multiple sample data sets in the second set of sample data sets include one or more missing data samples, select one or more original features of the samples in the multiple sample data sets in the first set of sample data sets to fill in the sample The position of the field corresponding to the blank feature.
15. The computer-readable storage medium of claim 15, wherein the computer-readable instructions are also executable by at least one processor, so that the at least one processor executes the following steps:

Inputting the multiple sample data sets into a random forest classification model, and classifying multiple samples corresponding to the multiple sample data sets into a first-type sample and a second-type sample;

Wherein, the first type of sample corresponds to the first group of sample data set, and the second type of sample corresponds to the second group of sample data set.
15. The computer-readable storage medium of claim 15, wherein the computer-readable instructions are also executable by at least one processor, so that the at least one processor executes the following steps:

Construct a KD tree from each sample in the first set of sample numbers;

Input the original features of the samples corresponding to the missing data samples into the nearest neighbor search model;

Finding the closest target sample to the missing data sample through the nearest neighbor search model;

Fill the target data corresponding to the sample blank feature in the target sample to the corresponding field position;

Wherein, the KD tree of the nearest neighbor search model is constructed by each sample in the first set of sample data sets.
The computer-readable storage medium of claim 14, wherein the plurality of target models include a logistic regression model, a factorization machine model, and a deep neural network model.
An underwriting risk assessment method, including the following steps:

Acquiring a target data set of the target customer, the target data set including multiple risk characteristics corresponding to multiple risk characteristic items;

Determine whether there are blank risk characteristics in the target data set of the target customer;

If there are blank risk characteristics, find the target sample closest to the target customer through the nearest neighbor search model, so as to fill the blank risk characteristics of the target data set with the risk characteristics in the target sample;

Input the filled target data set into the iterative decision tree model;

Output the corresponding risk feature combination through the iterative decision tree model;

Predict the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer. The underwriting risk assessment model is trained according to the training method of the insurance policy underwriting model based on big data described in right 1. .
The underwriting risk assessment method of claim 19, wherein the underwriting risk assessment model includes a logistic regression model, a factorization machine model, and a deep neural network model;

The step of predicting the risk feature combination according to the underwriting risk assessment model to obtain the risk coefficient of the target customer includes:

The risk coefficient of the target customer is calculated according to the first risk coefficient output by the logistic regression model, the second risk coefficient output by the factorization machine model, and the third risk coefficient output by the deep neural network model.