CN113159175B - Data prediction method, device, equipment and storage medium - Google Patents

Data prediction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113159175B
CN113159175B CN202110432977.0A CN202110432977A CN113159175B CN 113159175 B CN113159175 B CN 113159175B CN 202110432977 A CN202110432977 A CN 202110432977A CN 113159175 B CN113159175 B CN 113159175B
Authority
CN
China
Prior art keywords
data
node
verification
model
decision tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110432977.0A
Other languages
Chinese (zh)
Other versions
CN113159175A (en
Inventor
叶向荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110432977.0A priority Critical patent/CN113159175B/en
Publication of CN113159175A publication Critical patent/CN113159175A/en
Application granted granted Critical
Publication of CN113159175B publication Critical patent/CN113159175B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention relates to an intelligent decision making technology, and discloses a data prediction method, a device, equipment and a storage medium, wherein the method comprises the following steps: collecting sample data of a verification decision tree, wherein the sample data comprises request data, verification result data and processing result data, extracting a plurality of characteristic data in the request data, calculating an important coefficient of each characteristic data in each verification decision tree, selecting target characteristic data from the plurality of characteristic data according to the important coefficient, inputting the target characteristic data, the verification result data and the processing result data into the verification decision tree for training, fitting the trained plurality of verification decision trees to obtain a weak model sequence, combining the weak model sequences to obtain an aggregate model, obtaining request data to be verified, predicting the request data to be verified by using the aggregate model, and obtaining verification result data and processing result data corresponding to the request data to be verified. The invention can improve the accuracy of model prediction verification data.

Description

Data prediction method, device, equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a data prediction method, apparatus, device, and storage medium.
Background
When a user handles a certain service, the user generally needs to submit request data, and then the service organization performs verification on the request data, wherein the data verification refers to a process of verifying authenticity and/or legal compliance of the request data, and after verification, the service organization provides relevant services of the service for the user according to a verification result, and the scene is widely applied to different industries, such as a insurance claim of an insurance industry, loan approval of a financial industry and the like.
At present, the data verification comprises a manual verification mode and an artificial intelligence verification mode, but the information is easy to miss by adopting the manual verification mode, and the method is time-consuming, labor-consuming, strong in subjective portability and low in accuracy; however, the problem of the manual auditing method can be solved by adopting the artificial intelligence method, but the extracted verification elements are incomplete, or a large number of element features are removed for dimension reduction, so that the accuracy of the model is low.
Disclosure of Invention
The invention aims to provide a data prediction method, a data prediction device, data prediction equipment and a storage medium, which aim to improve accuracy of model prediction verification data.
The invention provides a data prediction method, which comprises the following steps:
collecting sample data of a verification decision tree, wherein the sample data comprises request data, verification result data obtained by verifying the request data and processing result data obtained by performing corresponding post-processing on the verification result data;
extracting a plurality of characteristic data in the request data, and calculating an important coefficient of each characteristic data in each verification decision tree according to a preset coefficient calculation method;
selecting target characteristic data from the plurality of characteristic data according to the important coefficient;
inputting the target characteristic data, the verification result data and the processing result data into a verification decision tree for training, and fitting a plurality of trained verification decision trees to obtain a weak model sequence;
combining the weak model sequences according to a preset combination mode to obtain a set model;
and obtaining request data to be verified, and predicting the request data to be verified by utilizing the set model to obtain verification result data and processing result data corresponding to the request data to be verified.
The invention also provides a data prediction device, which comprises:
the acquisition module is used for acquiring sample data of the verification decision tree, wherein the sample data comprises request data, verification result data obtained by verifying the request data and processing result data obtained by performing corresponding post-processing on the verification result data;
the computing module is used for extracting a plurality of characteristic data in the request data and computing important coefficients of each characteristic data in each verification decision tree according to a preset coefficient computing method;
the selecting module is used for selecting target characteristic data from the plurality of characteristic data according to the important coefficient;
the training module is used for inputting the target characteristic data, the verification result data and the processing result data into a verification decision tree for training, and fitting a plurality of trained verification decision trees to obtain a weak model sequence;
the combination module is used for combining the weak model sequences according to a preset combination mode to obtain a set model;
the prediction module is used for obtaining the request data to be verified, and predicting the request data to be verified by utilizing the set model to obtain verification result data and processing result data corresponding to the request data to be verified.
The invention also provides a computer device comprising a memory and a processor connected with the memory, wherein the memory stores a computer program which can run on the processor, and the processor realizes the data prediction step when executing the computer program.
The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of data prediction described above.
The beneficial effects of the invention are as follows: in the training decision tree process, the method calculates the important coefficient of each characteristic data in each verification decision tree, firstly selects a preset number of characteristic data, then selects high-quality characteristic data from the preset number of characteristic data as training data according to the important coefficient to train the verification decision tree, and because the characteristic data is not pruned, the obtained aggregate model prediction verification data and processing result data have high accuracy and good generalization performance.
Drawings
FIG. 1 is a flowchart of a data prediction method according to a first embodiment of the present invention;
FIG. 2 is a detailed flowchart illustrating the step of calculating the importance coefficients of each feature data in each verification decision tree according to the predetermined coefficient calculation method in FIG. 1;
FIG. 3 is a schematic diagram of an embodiment of a data prediction apparatus according to the present invention;
fig. 4 is a schematic diagram of a hardware architecture of an embodiment of a computer device according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
Referring to fig. 1, a flow chart of an embodiment of a data prediction method according to the present invention is shown. The data prediction method comprises the following steps:
step S1, collecting sample data of a verification decision tree, wherein the sample data comprises request data, verification result data obtained by verifying the request data and processing result data obtained by performing corresponding post-processing on the verification result data;
wherein, collecting sample data of the verification decision tree comprises: and randomly and repeatedly extracting data with the same data quantity as the data of the data set from the preset data set each time to serve as sample data of the verification decision tree.
When sample data of each verification decision tree is collected, if the size of the data set is N, for each verification decision tree, N training samples are randomly and repeatedly extracted from the data set to be used as a training set of the verification decision tree. Training by using one training set each time to obtain a verification decision tree model, and training k training sets to obtain k verification decision tree models.
In a preferred embodiment of the present invention, the request data is, for example, policy information, the verification result data is a verification result obtained by verifying the policy information according to a predetermined verification decision, and the processing result data is data for performing corresponding claims according to the verification result. The three results are limited by the various indexes of the policy information filled in during the underwriting when the underwriting is performed, including passing, failing (issuing exists), refusing to do so. A response model to the claim result is generated under these three underwriting results, where underwriting decisions act as one of the influencing factors. The sample data collection specifically comprises:
1. acquiring the policy information of the same product from an input database of the online system record insurance, wherein the insurance time is within a certain range, and taking the condition that the policy possibly has correction into consideration, using the latest policy data for all the policies meeting the conditions. And collecting stored quantization indexes which are taken as characteristic data and comprise information such as the ages, the birth places, the types of targets, dangerous seeds, responsibilities, insurance amount, premium, rate and insurance duration of the insurance applicant and the insured person.
2. And searching the check and protection data of the policy information, wherein the check and protection data comprise the pass, fail and refusal of the policy. Labeling the acquired policy information in the step 1 to form a corresponding relation between the underwriting and the underwriting, so that the underwriting quantitative index set of each policy corresponds to one underwriting state. Considering the change of the modified data, the statistical deviation needs to be reduced, the issued state of the policy is eliminated, and the three states of passing, failing and refusing of the verification of the latest policy are focused on acquisition.
3. And continuing to search the claim data such as the damage type, the pay amount, the risk number and the like for the insurance policy with the check and protection data searched. And (3) marking the sample data in the step (2) so as to generate data of the corresponding relation of three stages of underwriting, underwriting and claim settlement. So that a policy can be obtained by arrangement: and taking various quantitative indexes from the start of underwriting, the underwriting data and the final claim settlement result given by the underwriting person as the influence of underwriting on the insurance policy under the determined underwriting condition.
And storing all the acquired sample data into a newly established corresponding relation data table.
S2, extracting a plurality of characteristic data in the request data, and calculating an important coefficient of each characteristic data in each verification decision tree according to a preset coefficient calculation method;
in order to ensure that important features are not lost during feature screening of a verification decision tree, the importance degree of feature data needs to be analyzed to select feature data with high importance degree, and partial redundant features are removed.
In one embodiment, as shown in fig. 2, the step of calculating the significant coefficient of each feature data in each verification decision tree according to a predetermined coefficient calculation method specifically includes:
step S21, calculating the mean value of the variation of each characteristic data at the node n of the verification decision tree, the mean value of the variation of the node before the node n branches and the mean value of the variation of the node after the node n branches by adopting a coefficient calculation formula:
Figure BDA0003030645580000061
wherein, the value of the coefficient of the foundation is G, and the sequence of the characteristic data is X 1 ,X 2 ,…X i . Calculating the variation mean value G of the ith characteristic data split at the node n of the verification decision tree model by adopting a coefficient-based calculation formula n K is the number of classifications, μ nk Is the duty cycle of the sample data of class k in node n. Similarly, the mean value of the variation of the node before the node n branches and the mean value of the variation of the node after the node n branches can be calculated by adopting a coefficient calculation formula.
Step S22, inputting the change amount mean value of the node n, the change amount mean value of the node before branching of the node n, and the change amount mean value of the node after branching of the node n into a predetermined first formula to calculate, so as to obtain an important coefficient of the feature data at the node n, where the first formula is:
W in =G n -G P -G q ,W in is characteristic data X i The important coefficient at the node n of the verification decision tree, i is the characteristic data X i Sequence number in the signature sequence, G n G is the variation mean value of the node n P For the change quantity average value of the node p before the node n branches, G q And the mean value of the variation quantity of the node q after the node n branches.
Step S23, inputting the important coefficient of the node n into a preset second formula for calculation to obtain the important coefficient of the characteristic data in the verification decision tree, wherein the important coefficient is used as a basis for selecting the characteristic data subsequently, and the second formula is as follows:
Figure BDA0003030645580000062
r is characteristic data X i Important coefficients, W, in the verification decision tree i Is characteristic data X i Important coefficients, W, at node n of the verification decision tree j Is characteristic data X i And c is the number of the verification decision trees and j is the serial number of the verification decision trees at the important coefficient of the j-th verification decision tree node n.
S3, selecting target characteristic data from a plurality of characteristic data according to the important coefficient;
if the dimension of the feature data of each sample data is M, a predetermined number of constants M < M are designated, M feature subsets are randomly selected from the M features, and each time the verification decision tree is split, the optimal feature data are selected from the M feature data, namely, the optimal feature data are selected as target feature data according to the order of the importance coefficient R from large to small, in the process, each tree grows to the greatest extent, and one or more feature data are not completely excluded, namely, the pruning process is not performed.
S4, inputting the target characteristic data, the verification result data and the processing result data into a verification decision tree for training, and fitting a plurality of trained verification decision trees to obtain a weak model sequence;
inputting the target feature data, the verification result data and the processing result data into corresponding verification decision trees for training, and in one embodiment, establishing a vector sequence S according to the target feature data, the verification result data and the processing result data i (i=1, 2, …, k), i being the vector number, k being the number of vectors of the vector sequence, using S i (i=1, 2, …, k) training each individual verification decision tree model u (X, S i ) I=1, 2, … k, wherein X in the model is a verification decision variable, and the model is used as an independent variable, and a weak model sequence { u } is obtained after k times of fitting 1 (X),u 2 (X),…,u k (X) }. The weak model is a feature of the random forest model, and since the random forest model uses multiple decision trees for individual prediction, the combination of these tree prediction results jointly determines the final result.
S5, combining the weak model sequences according to a preset combination mode to obtain a set model;
wherein the weak model sequences { u } are combined in a predetermined manner 1 (X),u 2 (X),…,u k (X) } the combination specifically comprises: and inputting the weak model sequence into a maximum function for operation, and establishing a set model. The aggregate model is a classification model, and the maximum value function is adopted to combine to obtain the aggregate model so as to select a final classification result from multiple classifications:
Figure BDA0003030645580000071
u (X) is the aggregate model, X is the predetermined verification decision, U i (X) is the weak model sequence, i is the sequence number of the weak model sequence, k is the length of the weak model sequence, Z is a response variable, L is a collective indication function, and arg max is a maximum function.
And S6, obtaining request data to be verified, and predicting the request data to be verified by utilizing the set model to obtain verification result data and processing result data corresponding to the request data to be verified.
According to the embodiment, the collection model is used for carrying out prediction of the insurance policy and the claim settlement, a large amount of data stored in the whole flow of the online product is fully utilized, a collection model for prediction is trained, the collection model is used for rapidly giving the insurance index and obtaining decision suggestions, the time consumption of auditing the data is saved, and meanwhile the problems of partial information omission, substitution of subjective factors and the like caused by manual insurance are avoided. The algorithm of the embodiment ensures the comprehensiveness of characteristic data acquisition as much as possible, the accuracy of the set model is high, the generalization performance is good, the training by the GPU is not needed, and the high timeliness requirement of intelligent verification is met.
As can be seen from the above description, in the training decision tree process of the embodiment, the important coefficient of each feature data in each verification decision tree is calculated, a predetermined number of feature data is selected first, then, according to the order of the important coefficients from large to small, high-quality feature data is selected from the predetermined number of feature data as training data to train the verification decision tree, and because the feature data is not pruned, the accuracy of predicting the verification data and the processing result data by the obtained set model is high, and the generalization performance is good.
In an embodiment, on the basis of the above embodiment, before the step S6, the method further includes the following steps:
adaptively adjusting the number of the verification decision trees and the maximum tree depth in the set model by adopting a verification curve, and adjusting the number of sample data of each verification decision tree in the set model by adopting a learning curve;
and testing the adjusted aggregate model, and if the accuracy rate of the adjusted aggregate model obtained by testing is greater than or equal to the preset accuracy rate, using the adjusted aggregate model for prediction.
After the aggregate model is obtained, there may be an effect of under-fitting or over-fitting, at this time, a classification effect of the aggregate model may be evaluated by using a verification curve, which is essentially an influence of the super parameter on the training score and the verification score, so as to obtain an optimal parameter, and specifically includes adaptively adjusting the number of verification decision trees and the maximum tree depth in the aggregate model by using the verification curve; and adjusting the training set size of each verification decision tree by using a learning curve to obtain the optimal training set size, and improving the generalization performance of the set model.
Further, part of data is used in the process of training the verification decision tree, the rest of data is not used, policy data in the rest of data can be used as parameters, and under the corresponding verification decision, a set model is used for prediction to obtain a corresponding verification result and a claim settlement result, if the predicted verification result and the claim settlement result are close to the actual verification result and the claim settlement result, and the accuracy of the total claim settlement result reaches a preset threshold (such as 85%), the set model is valid, and the method is applicable to the policy to be verified later.
In one embodiment, the present invention provides a data prediction device, where the data prediction device corresponds to the method in the above embodiment one by one. As shown in fig. 3, the data prediction apparatus includes:
the acquisition module 101 is configured to acquire sample data of a verification decision tree, where the sample data includes request data, verification result data obtained by verifying the request data, and processing result data obtained by performing corresponding post-processing on the verification result data;
a calculating module 102, configured to extract a plurality of feature data in the request data, and calculate an important coefficient of each feature data in each verification decision tree according to a predetermined coefficient calculating method;
a selecting module 103, configured to select target feature data from a plurality of feature data according to the importance coefficient;
the training module 104 is configured to input the target feature data, the verification result data, and the processing result data into a verification decision tree for training, and fit a plurality of trained verification decision trees to obtain a weak model sequence;
a combining module 105, configured to combine the weak model sequences in a predetermined combination manner to obtain a set model;
and the prediction module 106 is configured to obtain the request data to be verified, and predict the request data to be verified by using the set model to obtain verification result data and processing result data corresponding to the request data to be verified.
The specific limitation of the data prediction apparatus may be referred to above as limitation of the data prediction method, and will not be described herein. The respective modules in the above-described data prediction apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which is a device capable of automatically performing numerical calculation and/or information processing in accordance with instructions set or stored in advance. The computer device may be a PC (Personal Computer ), or a smart phone, a tablet computer, a server group formed by a single network server, a plurality of network servers, or a cloud based on cloud computing, where the cloud computing is a kind of distributed computing, and is a super virtual computer formed by a group of loosely coupled computer sets.
As shown in fig. 4, the computer device may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which may be communicatively connected to each other through a system bus, the memory 11 storing a computer program executable on the processor 12. It should be noted that FIG. 4 only shows a computer device having components 11-13, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.
Wherein the memory 11 may be non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others. In this embodiment, the readable storage medium of the memory 11 is typically used for storing an operating system and various application software installed on a computer device, for example, for storing program codes of a computer program in an embodiment of the present invention. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.
The processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip for executing program code stored in the memory 11 or for processing data, such as executing a computer program or the like.
The network interface 13 may comprise a standard wireless network interface, a wired network interface, which network interface 13 is typically used to establish communication connections between the computer device and other electronic devices.
The computer program is stored in the memory 11 and comprises at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement the method of embodiments of the present application, comprising:
collecting sample data of a verification decision tree, wherein the sample data comprises request data, verification result data obtained by verifying the request data and processing result data obtained by performing corresponding post-processing on the verification result data;
wherein collecting sample data for each of the verification decision trees comprises: and randomly and repeatedly extracting data with the same data quantity as the data quantity of the data set from the preset data set each time to serve as sample data of each verification decision tree.
When sample data of each verification decision tree is collected, if the size of the data set is N, for each verification decision tree, N training samples are randomly and repeatedly extracted from the data set to be used as a training set of the verification decision tree. Training by using one training set each time to obtain a verification decision tree model, and training k training sets to obtain k verification decision tree models.
In a preferred embodiment of the present invention, the request data is, for example, policy information, the verification result data is a verification result obtained by verifying the policy information according to a predetermined verification decision, and the processing result data is data for performing corresponding claims according to the verification result. The three results are limited by the various indexes of the policy information filled in during the underwriting when the underwriting is performed, including passing, failing (issuing exists), refusing to do so. A response model to the claim result is generated under these three underwriting results, where underwriting decisions act as one of the influencing factors. The sample data collection specifically comprises:
1. acquiring the policy information of the same product from an input database of the online system record insurance, wherein the insurance time is within a certain range, and taking the condition that the policy possibly has correction into consideration, using the latest policy data for all the policies meeting the conditions. And collecting stored quantization indexes which are taken as characteristic data and comprise information such as the ages, the birth places, the types of targets, dangerous seeds, responsibilities, insurance amount, premium, rate and insurance duration of the insurance applicant and the insured person.
2. And searching the check and protection data of the policy information, wherein the check and protection data comprise the pass, fail and refusal of the policy. Labeling the acquired policy information in the step 1 to form a corresponding relation between the underwriting and the underwriting, so that the underwriting quantitative index set of each policy corresponds to one underwriting state. Considering the change of the modified data, the statistical deviation needs to be reduced, the issued state of the policy is eliminated, and the three states of passing, failing and refusing of the verification of the latest policy are focused on acquisition.
3. And continuing to search the claim data such as the damage type, the pay amount, the risk number and the like for the insurance policy with the check and protection data searched. And (3) marking the sample data in the step (2) so as to generate data of the corresponding relation of three stages of underwriting, underwriting and claim settlement. Therefore, various quantitative indexes of a policy from the start of underwriting, the underwriting data given by the underwriting person and the final claim settlement result can be obtained in an arrangement mode, and under the determined underwriting condition, the underwriting influences on the policy are obtained.
And storing all the acquired sample data into a newly established corresponding relation data table.
Extracting a plurality of characteristic data in the request data, and calculating an important coefficient of each characteristic data in each verification decision tree according to a preset coefficient calculation method;
in order to ensure that important features are not lost during feature screening of a verification decision tree, the importance degree of feature data needs to be analyzed to select feature data with high importance degree, and partial redundant features are removed.
In one embodiment, the step of calculating the importance coefficient of each feature data in each verification decision tree according to a predetermined coefficient calculation method specifically includes:
calculating the variation average value of each characteristic data at the node n of the verification decision tree, the variation average value of the node before the node n branches and the variation average value of the node after the node n branches by adopting a coefficient calculation formula:
Figure BDA0003030645580000131
wherein the coefficient of the radix is G, and the sequence of the characteristic data is (i.e. X 1 ,X 2 ,…X i ). Calculating the variation mean value G of the ith characteristic data split at the node n of the verification decision tree model by adopting a coefficient-based calculation formula n K is the number of classifications, μ nk Is the duty cycle of the sample data of class k in node n.
Inputting the change amount mean value of the node n, the change amount mean value of the node before the node n branches and the change amount mean value of the node after the node n branches into a preset first formula for calculation to obtain the important coefficient of the characteristic data at the node n, wherein the first formula is as follows:
W in =G n -G P -G q ,W in is characteristic data X i The important coefficient at the node n of the verification decision tree, i is the characteristic data X i Sequence number in the signature sequence, G n G is the variation mean value of the node n P For the change quantity average value of the node p before the node n branches, G q And the mean value of the variation quantity of the node q after the node n branches.
Inputting the important coefficient of the node n into a preset second formula to calculate, so as to obtain the important coefficient of the characteristic data in the verification decision tree, wherein the important coefficient is used as a basis for selecting the characteristic data subsequently, and the second formula is as follows:
Figure BDA0003030645580000132
r is characteristic data X i Important coefficients, W, in the verification decision tree i Is characteristic data X i Important coefficients, W, at node n of the verification decision tree j Is characteristic data X i And c is the number of the verification decision trees and j is the serial number of the verification decision trees at the important coefficient of the j-th verification decision tree node n.
Selecting target characteristic data from a plurality of characteristic data according to the important coefficient;
if the dimension of the feature data of each sample data is M, a predetermined number of constants M < M are designated, M feature subsets are randomly selected from the M features, and each time the verification decision tree is split, the optimal feature data are selected from the M feature data, namely, the optimal feature data are selected as target feature data according to the order of the importance coefficient R from large to small, in the process, each tree grows to the greatest extent, and one or more feature data are not completely excluded, namely, the pruning process is not performed.
Inputting the target characteristic data, the verification result data and the processing result data into a verification decision tree for training, and fitting a plurality of trained verification decision trees to obtain a weak model sequence;
then, inputting the target feature data, the verification result data and the processing result data into the corresponding verification decision tree for training, and in one embodiment, establishing a vector sequence S according to the target feature data, the verification result data and the processing result data i (i=1, 2, …, k), i being the vector number, k being the number of vectors of the vector sequence, using S i (i=1, 2, …, k) training each individual verification decision tree model u (X, S i ) I=1, 2, … k, wherein X in the model is a verification decision variable, and the model is used as an independent variable, and a weak model sequence { u } is obtained after k times of fitting 1 (X),u 2 (X),…,u k (X) }. The weak model is a feature of the random forest model, and since the random forest model uses multiple decision trees for individual prediction, the combination of these tree prediction results jointly determines the final result.
Combining the weak model sequences according to a preset combination mode to obtain a set model;
wherein the weak model sequences { u } are combined in a predetermined manner 1 (X),u 2 (X),…,u k (X) } the combination specifically comprises: and inputting the weak model sequence into a maximum function for operation, and establishing a set model. The aggregate model is a classification model, and the maximum value function is adopted to combine to obtain the aggregate model so as to select a final classification result from multiple classifications:
Figure BDA0003030645580000141
u (X) is the aggregate model, X is the predetermined verification decision, U i (X) is the weak model sequence, i is the sequence number of the weak model sequence, k is the length of the weak model sequence, Z is a response variable, L is a collective indication function, and arg max is a maximum function.
And obtaining request data to be verified, and predicting the request data to be verified by utilizing the set model to obtain verification result data and processing result data corresponding to the request data to be verified.
According to the embodiment, the collection model is used for carrying out prediction of the insurance policy and the claim settlement, a large amount of data stored in the whole flow of the online product is fully utilized, a collection model for prediction is trained, the collection model is used for rapidly giving the insurance index and obtaining decision suggestions, the time consumption of auditing the data is saved, and meanwhile the problems of partial information omission, substitution of subjective factors and the like caused by manual insurance are avoided. The algorithm of the embodiment ensures the comprehensiveness of characteristic data acquisition as much as possible, the accuracy of the set model is high, the generalization performance is good, the training by the GPU is not needed, and the high timeliness requirement of intelligent verification is met.
In an embodiment, before the step of predicting the request data to be verified, the method further includes the following steps:
adaptively adjusting the number of the verification decision trees and the maximum tree depth in the set model by adopting a verification curve, and adjusting the number of sample data of each verification decision tree in the set model by adopting a learning curve;
and testing the adjusted aggregate model, and if the accuracy rate of the adjusted aggregate model obtained by testing is greater than or equal to the preset accuracy rate, using the adjusted aggregate model for prediction.
After the aggregate model is obtained, there may be an effect of under-fitting or over-fitting, at this time, a classification effect of the aggregate model may be evaluated by using a verification curve, which is essentially an influence of the super parameter on the training score and the verification score, so as to obtain an optimal parameter, and specifically includes adaptively adjusting the number of verification decision trees and the maximum tree depth in the aggregate model by using the verification curve; and adjusting the training set size of each verification decision tree by using a learning curve to obtain the optimal training set size, and improving the generalization performance of the set model.
Further, part of data is used in the process of training the verification decision tree, the rest of data is not used, policy data in the rest of data can be used as parameters, and under the corresponding verification decision, a set model is used for prediction to obtain a corresponding verification result and a claim settlement result, if the predicted verification result and the claim settlement result are close to the actual verification result and the claim settlement result, and the accuracy of the total claim settlement result reaches a preset threshold (such as 85%), the set model is valid, and the method is applicable to the policy to be verified later.
In one embodiment, the present invention provides a computer readable storage medium, which may be a nonvolatile and/or volatile memory, having stored thereon a computer program, which when executed by a processor, implements the steps of the data prediction method in the above embodiment, such as steps S1 to S6 shown in fig. 1. Alternatively, the computer program when executed by the processor implements the functions of the respective modules/units of the data prediction apparatus in the above embodiments, such as the functions of the modules 101 to 106 shown in fig. 3. In order to avoid repetition, a description thereof is omitted.
Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by way of a computer program comprising the steps of embodiments of the methods described above when executed by associated hardware.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (6)

1. A method of data prediction, comprising:
collecting sample data of a verification decision tree, wherein the sample data comprises request data, verification result data obtained by verifying the request data and processing result data obtained by performing corresponding post-processing on the verification result data;
extracting a plurality of characteristic data in the request data, and calculating an important coefficient of each characteristic data in each verification decision tree according to a preset coefficient calculation method;
selecting target characteristic data from the plurality of characteristic data according to the important coefficient;
inputting the target characteristic data, the verification result data and the processing result data into a verification decision tree for training, and fitting a plurality of trained verification decision trees to obtain a weak model sequence;
combining the weak model sequences according to a preset combination mode to obtain a set model;
acquiring request data to be verified, and predicting the request data to be verified by utilizing the set model to obtain verification result data and processing result data corresponding to the request data to be verified;
the step of calculating the important coefficient of each feature data in each verification decision tree according to a predetermined coefficient calculation method specifically comprises the following steps: calculating the variation average value of each characteristic data at the node n of the verification decision tree, the variation average value of the node before the node n branches and the variation average value of the node after the node n branches by adopting a coefficient calculation formula; inputting a change amount mean value of the node n, a change amount mean value of the node before the node n branches and a change amount mean value of the node after the node n branches into a preset first formula for calculation to obtain an important coefficient of the characteristic data at the node n; inputting the important coefficient of the node n into a preset second formula for calculation to obtain the important coefficient of the characteristic data in the verification decision tree;
the first formula includes: w (W) in =G n -G P -G q ,W in Is characteristic data X i The important coefficient at the node n of the verification decision tree, i is the characteristic data X i Sequence number in the signature sequence, G n G is the variation mean value of the node n P For the change quantity average value of the node p before the node n branches, G q The mean value of the variation of the node q after the node n branches is obtained;
the second formula includes:
Figure FDA0004211029780000021
r is characteristic data X i Important coefficients, W, in the verification decision tree in Is characteristic data X i Important coefficients, W, at node n of the verification decision tree j Is characteristic data X i The important coefficient of the node n of the j-th verification decision tree is c, the number of the verification decision trees is j, and the sequence number of the verification decision trees is j;
the step of combining the weak model sequences according to a predetermined combination mode to obtain a set model specifically comprises the following steps: inputting the weak model sequence into a maximum function for operation to obtain the aggregate model, wherein the aggregate model is
Figure FDA0004211029780000022
U (X) is the aggregate model, X is the predetermined verification decision, U i (X) is the weak model sequence, i is the serial number of the weak model in the weak model sequence, and k is the weak modelThe length of the sequence, Z is the response variable, L is the collective indication function, arg max is the maximum function.
2. The method for predicting data according to claim 1, wherein the step of obtaining the request data to be verified, predicting the request data to be verified by using the set model, and obtaining verification result data and processing result data corresponding to the request data to be verified further comprises:
adaptively adjusting the number of the verification decision trees and the maximum tree depth in the set model by adopting a verification curve, and adjusting the number of sample data of each verification decision tree in the set model by adopting a learning curve;
and testing the adjusted aggregate model, and if the accuracy rate of the adjusted aggregate model obtained by testing is greater than or equal to the preset accuracy rate, using the adjusted aggregate model for prediction.
3. The method of claim 1, wherein the collecting sample data of a verification decision tree comprises: and randomly and repeatedly extracting data with the same data quantity as the data quantity of the data set from the preset data set each time to serve as sample data of each verification decision tree.
4. A data prediction apparatus, comprising:
the acquisition module is used for acquiring sample data of the verification decision tree, wherein the sample data comprises request data, verification result data obtained by verifying the request data and processing result data obtained by performing corresponding post-processing on the verification result data;
the computing module is used for extracting a plurality of characteristic data in the request data and computing important coefficients of each characteristic data in each verification decision tree according to a preset coefficient computing method;
the selecting module is used for selecting target characteristic data from the plurality of characteristic data according to the important coefficient;
the training module is used for inputting the target characteristic data, the verification result data and the processing result data into a verification decision tree for training, and fitting a plurality of trained verification decision trees to obtain a weak model sequence;
the combination module is used for combining the weak model sequences according to a preset combination mode to obtain a set model;
the prediction module is used for obtaining request data to be verified, and predicting the request data to be verified by utilizing the set model to obtain verification result data and processing result data corresponding to the request data to be verified;
the step of calculating the important coefficient of each feature data in each verification decision tree according to a predetermined coefficient calculation method specifically comprises the following steps: calculating the variation average value of each characteristic data at the node n of the verification decision tree, the variation average value of the node before the node n branches and the variation average value of the node after the node n branches by adopting a coefficient calculation formula; inputting a change amount mean value of the node n, a change amount mean value of the node before the node n branches and a change amount mean value of the node after the node n branches into a preset first formula for calculation to obtain an important coefficient of the characteristic data at the node n; inputting the important coefficient of the node n into a preset second formula for calculation to obtain the important coefficient of the characteristic data in the verification decision tree;
the first formula includes: w (W) in =G n -G P -G q ,W in Is characteristic data X i The important coefficient at the node n of the verification decision tree, i is the characteristic data X i Sequence number in the signature sequence, G n G is the variation mean value of the node n P For the change quantity average value of the node p before the node n branches, G q The mean value of the variation of the node q after the node n branches is obtained;
the second formula includes:
Figure FDA0004211029780000041
r is characteristic data X i Important coefficients, W, in the verification decision tree in Is characteristic data X i Important coefficients, W, at node n of the verification decision tree j Is characteristic data X i The important coefficient of the node n of the j-th verification decision tree is c, the number of the verification decision trees is j, and the sequence number of the verification decision trees is j;
the step of combining the weak model sequences according to a predetermined combination mode to obtain a set model specifically comprises the following steps: inputting the weak model sequence into a maximum function for operation to obtain the aggregate model, wherein the aggregate model is
Figure FDA0004211029780000042
U (X) is the aggregate model, X is the predetermined verification decision, U i (X) is the weak model sequence, i is the serial number of the weak model in the weak model sequence, k is the length of the weak model sequence, Z is a response variable, L is a collective indication function, and arg max is the maximum function.
5. A computer device comprising a memory and a processor connected to the memory, the memory storing a computer program executable on the processor, wherein the processor, when executing the computer program, implements the steps of the data prediction method according to any one of claims 1 to 3.
6. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the data prediction method according to any of claims 1 to 3.
CN202110432977.0A 2021-04-21 2021-04-21 Data prediction method, device, equipment and storage medium Active CN113159175B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110432977.0A CN113159175B (en) 2021-04-21 2021-04-21 Data prediction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110432977.0A CN113159175B (en) 2021-04-21 2021-04-21 Data prediction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113159175A CN113159175A (en) 2021-07-23
CN113159175B true CN113159175B (en) 2023-06-06

Family

ID=76868071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110432977.0A Active CN113159175B (en) 2021-04-21 2021-04-21 Data prediction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113159175B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549954A (en) * 2018-03-26 2018-09-18 平安科技(深圳)有限公司 Risk model training method, risk identification method, device, equipment and medium
RU2017140974A3 (en) * 2017-11-24 2019-05-24
RU2017140969A (en) * 2017-11-24 2019-05-27 Общество С Ограниченной Ответственностью "Яндекс" Method and system for creating a forecast quality parameter for a predictive model performed in a machine learning algorithm
CN110264342A (en) * 2019-06-19 2019-09-20 深圳前海微众银行股份有限公司 A kind of business audit method and device based on machine learning
CN111562965A (en) * 2020-04-27 2020-08-21 深圳木成林科技有限公司 Page data verification method and device based on decision tree
WO2020249125A1 (en) * 2019-06-14 2020-12-17 第四范式(北京)技术有限公司 Method and system for automatically training machine learning model
CN112330471A (en) * 2020-11-17 2021-02-05 中国平安财产保险股份有限公司 Service data processing method and device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2017140974A3 (en) * 2017-11-24 2019-05-24
RU2017140969A (en) * 2017-11-24 2019-05-27 Общество С Ограниченной Ответственностью "Яндекс" Method and system for creating a forecast quality parameter for a predictive model performed in a machine learning algorithm
CN108549954A (en) * 2018-03-26 2018-09-18 平安科技(深圳)有限公司 Risk model training method, risk identification method, device, equipment and medium
WO2020249125A1 (en) * 2019-06-14 2020-12-17 第四范式(北京)技术有限公司 Method and system for automatically training machine learning model
CN110264342A (en) * 2019-06-19 2019-09-20 深圳前海微众银行股份有限公司 A kind of business audit method and device based on machine learning
CN111562965A (en) * 2020-04-27 2020-08-21 深圳木成林科技有限公司 Page data verification method and device based on decision tree
CN112330471A (en) * 2020-11-17 2021-02-05 中国平安财产保险股份有限公司 Service data processing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113159175A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN109829776B (en) Merchant risk assessment method, device, computer equipment and storage medium
CN108876133B (en) Risk assessment processing method, device, server and medium based on business information
CN109858737B (en) Grading model adjustment method and device based on model deployment and computer equipment
CN109345374A (en) Risk control method, device, computer equipment and storage medium
CN109815803B (en) Face examination risk control method and device, computer equipment and storage medium
CN109858957A (en) Finance product recommended method, device, computer equipment and storage medium
CN110929879A (en) Business decision logic updating method based on decision engine and model platform
CN110474871B (en) Abnormal account detection method and device, computer equipment and storage medium
CN110084606B (en) Risk control method, apparatus, computer device and storage medium
CN113822488B (en) Risk prediction method and device for financing lease, computer equipment and storage medium
CN110609954A (en) Data acquisition method and device, computer equipment and storage medium
US20140222737A1 (en) System and Method for Developing Proxy Models
CN112151141A (en) Medical data processing method, device, computer equipment and storage medium
CN110503566B (en) Wind control model building method and device, computer equipment and storage medium
CN109509087A (en) Intelligentized loan checking method, device, equipment and medium
CN110135943B (en) Product recommendation method, device, computer equipment and storage medium
CN111260438A (en) Product configuration method and device, computer equipment and storage medium
CN112434216A (en) Intelligent investment project recommendation method and device, storage medium and computer equipment
CN115630221A (en) Terminal application interface display data processing method and device and computer equipment
CN112990989B (en) Value prediction model input data generation method, device, equipment and medium
CN113159175B (en) Data prediction method, device, equipment and storage medium
CN114692785B (en) Behavior classification method, device, equipment and storage medium
CN115239378A (en) Plant valuation method, system and storage medium
CN114648406A (en) User credit integral prediction method and device based on random forest
CN112232951A (en) Credit evaluation method, device, equipment and medium based on multi-dimensional cross feature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant