CN110517154A

CN110517154A - Data model training method, system and computer equipment

Info

Publication number: CN110517154A
Application number: CN201910665010.XA
Authority: CN
Inventors: 王进; 刘行行
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2019-11-29

Abstract

The embodiment of the invention provides a kind of data model training methods, comprising: obtains the first training dataset, first training dataset includes multiple sample data sets of multiple client's samples；Multiple sample datas that each sample data is concentrated are recorded in the corresponding multiple feature of risk fields of multiple default feature of risk items, obtain multiple feature of risk column corresponding with the multiple feature of risk field；Each feature of risk field corresponding feature of risk column are analyzed, the WoE of each default feature of risk item is obtained；By the multiple sample data set and the multiple default feature of risk item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.The data exception detection model of the present embodiment training can effectively detect the unusual customers of declaration form core guarantor, to solve the problems, such as that part abnormal data influences the prediction accuracy of data model.

Description

Data model training method, system and computer equipment

Technical field

The present embodiments relate to field of computer data processing more particularly to a kind of data model training method, system, Computer equipment and computer readable storage medium.

Background technique

As people's insurance awareness gradually increases, business insurance has become the important composition portion of current social security system Point.According to can refer to data, the declaration form quantity of partial insurance mechanism is in ten million rank.These declaration forms generate it in insurance system Afterwards, it needs to carry out core guarantor to declaration form, whether insured requirement is met with the information in definite policy.Now to the core guarantor side of declaration form Formula, usually by manually carrying out core guarantor.With the fast development that big data is dug, for core protect to can refer to data more and more, Industry starts based on big data and carries out data modeling and carry out declaration form core guarantor by data model.

However, can refer to that data are less in view of new client, such as the finite data being only limitted on insurance application.For data For model, data are more, and analysis dimension is more, then the prediction accuracy of data model is higher.It is therefore desirable to from each canal Road obtains client's data as multiple as possible, however, channel excessively be easy to cause the quality of data poor, such as partial data exists It is abnormal, and the part abnormal data may be affected to the prediction accuracy of data model.Therefore, data matter how is promoted Amount, and then the prediction accuracy and forecasting efficiency of data model are promoted, it is one of technical problem in the urgent need to address.

Summary of the invention

In view of this, the purpose of the embodiment of the present invention is that providing a kind of data model training method, system, computer equipment And computer readable storage medium, it can solve the problem of part abnormal data influences the prediction accuracy of data model.

To achieve the above object, the embodiment of the invention provides a kind of data model training methods, comprising the following steps:

The first training dataset is obtained, first training dataset includes multiple sample datas of multiple client's samples Collection；

The corresponding multiple wind of multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated In dangerous feature field, multiple feature of risk column corresponding with the multiple feature of risk field are obtained；

Each feature of risk field corresponding feature of risk column are analyzed, each default feature of risk item is obtained WoE；And

Pass through the multiple sample data set and the multiple default feature of risk item more orphans of corresponding multiple WoE training Vertical tree, to obtain data exception detection model.

Preferably, the step of obtaining the first training dataset, comprising:

Multiple feature of risk fields of target databases are configured according to multiple default feature of risk items, and each risk is special Corresponding field in sign field and multiple data sources establishes mapping relations；

According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source； And

Cleaning lacks the invalid raw data set of multiple feature of risk data, to obtain from the multiple initial data concentration Take effective multiple sample data sets.

Preferably, according to the mapping relations, the multiple original of multiple client's samples is extracted from the multiple data source The step of data set, comprising:

When the same feature of risk field of same client's sample is corresponding with multiple feature of risk data in multiple data sources When, according to the default weight coefficient of each data source, select the corresponding feature of risk data of the highest data source of weight coefficient.

Preferably, the formula for calculating the WoE of each default feature of risk item is as follows:

WoE_iIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form core_iIndicate pre- If feature of risk item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections；Pz_i Indicate that default feature of risk item i protects number in the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections The ratio between amount.

Preferably, it is instructed by the corresponding multiple WoE of the multiple sample data set and the multiple default feature of risk item Practice more isolated trees, the step of to obtain data exception detection model, comprising:

Sampling with replacement obtains multiple client's sample sets from the multiple client's sample；

Isolated tree construction step: (1) one of client's sample set in the multiple client's sample set is selected； (2) it is selected one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to the client selected Multiple client's samples in sample set divide；(4) (2)~(4) are repeated the above steps until the isolated tree reaches and sets Fixed height limitation；

The isolated tree construction step is repeated to obtain more isolated trees, the more isolated tree groups are combined into the number According to abnormality detection model.

Preferably, further includes:

The second training dataset is obtained, second training dataset includes multiple target samples of multiple target customer's samples Notebook data collection；

Each target sample data set is input in the data exception detection model, is detected by the data exception Model exports the coefficient of variation of each target sample data set；

According to the coefficient of variation of each target sample data set, screening is concentrated to obtain third instruction from second training data Practice data set, the abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold；

According to the third training dataset, risk evaluation model is protected to declaration form core and is trained.

Preferably, the coefficient of variation of each target sample data set is calculated by the following formula to obtain:

C (n)=2H (n-1)-(2 (n-1)/n)

H (k)=In (k)+ξ

X is used to identify one of target sample data set in n target sample data set；S (x, n) is target sample The coefficient of variation that data set x is obtained in the isolated tree being made of n target sample data set；C (n) is target sample data set When number is n, the average path length of isolated tree；E (h (x)) is corresponding target sample data set x on the road of more isolated trees Electrical path length mean value；H (x) is path length of the corresponding target sample data set x in each isolated tree；K is root node to leaf The path length of node；ξ is Euler Parameter；H (k) is reconciliation number.

To achieve the above object, the embodiment of the invention also provides data model training systems, comprising:

Module is obtained, for obtaining the first training dataset, first training dataset includes multiple client's samples Multiple sample data sets；

Multiple default feature of risk items are recorded in logging modle, multiple sample datas for concentrating each sample data In corresponding multiple feature of risk fields, multiple feature of risk column corresponding with the multiple feature of risk field are obtained；

Analysis module obtains each default for analyzing each feature of risk field corresponding feature of risk column The WoE of feature of risk item；

Abnormality detection model training module, for passing through the multiple sample data set and the multiple default feature of risk Item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.

To achieve the above object, the embodiment of the invention also provides a kind of computer equipment, the computer equipment storages Device, processor and it is stored in the computer program that can be run on the memory and on the processor, the computer journey The step of data model training method as described above is realized when sequence is executed by processor.

To achieve the above object, the embodiment of the invention also provides a kind of computer readable storage medium, the computers Computer program is stored in readable storage medium storing program for executing, the computer program can be performed by least one processor, so that institute State the step of at least one processor executes data model training method as described above.

Data model training method, system, computer equipment and computer-readable storage medium provided in an embodiment of the present invention Matter, by more isolated trees of multiple default corresponding feature of risk data training of feature of risk item in multiple client's samples, with structure Build the data exception detection model being made of multiple isolated tree.Constructed data exception detection model can be effectively detected The unusual customers protected to declaration form core are promoted to solve the problems, such as that part abnormal data influences the prediction accuracy of data model Forecasting efficiency.

Detailed description of the invention

Fig. 1 is the flow chart of data model training method embodiment one of the present invention.

Fig. 2 is the flow chart of data model training method embodiment two of the present invention.

Fig. 3 is the program module schematic diagram of data model training system embodiment three of the present invention.

Fig. 4 is the hardware structural diagram of computer equipment example IV of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work Every other embodiment obtained is put, shall fall within the protection scope of the present invention.

It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as its relative importance of indication or suggestion or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must be based on can be realized by those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection scope within.

Following embodiment will be that executing subject carries out exemplary description with computer equipment 2.

Embodiment one

Refering to fig. 1, the step flow chart of the data model training method of the embodiment of the present invention one is shown.It is appreciated that Flow chart in this method embodiment, which is not used in, is defined the sequence for executing step.It is specific as follows.

Step S100 obtains the first training dataset, and first training dataset includes the multiple of multiple client's samples Sample data set.

Specifically, the acquisition of multiple sample data sets may include a variety of sources, such as intra-company's database or outside Third party database, the third database can be the business database of third research firm, be also possible to credit investigation system Common data base.Each sample data set may include: corresponding client's sample client's essential information (gender, the age, occupation, Educational background etc.), social security information, policy information (e.g., declaration form protection amount, insurance kind information, with social relationships of income people etc.), family believe (e.g., the fund moon flows into label, the fund moon for breath, credit information (e.g., credit information, overdue loan information etc.), funds flow information Flow out label etc.) and all kinds of behavioral datas, all kinds of behavioral datas include but is not limited to that internet information (e.g., believe by buying behavior Breath).

In the exemplary embodiment, step S100 further comprises: carrying out ETL to the initial data of each data source (Extract-Transformation-Loading, extraction, conversion and the load of data) operation, to obtain the multiple sample Data set.It is specific as follows.

Step S100A, according to multiple feature of risk fields of multiple default feature of risk item configuration target databases, and will Corresponding field in each feature of risk field and multiple data sources establishes mapping relations.

Each feature of risk field may map multiple fields of multiple data sources.

Step S100B extracts multiple originals of multiple client's samples according to the mapping relations from the multiple data source Beginning data set.

Illustratively, when the same feature of risk field of same client's sample is corresponding with multiple risks in multiple data sources When characteristic, according to the default weight coefficient of each data source, select the corresponding risk of the highest data source of weight coefficient special Levy data.

Step S100C, cleaning lack the invalid raw data set of multiple feature of risk data, with from the multiple original Effective multiple sample data sets are obtained in data set.

Step S102, it is corresponding that multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated Multiple feature of risk fields in, obtain multiple feature of risk column corresponding with the multiple feature of risk field.

The corresponding sample data set of each client's sample, it includes multiple sample datas that each sample data, which is concentrated, often A sample data is filled into corresponding feature of risk field, and multiple sample datas of the multiple sample data set can be with structure At N number of characteristic series, N number of characteristic series correspond to N number of default feature of risk item, such as:

By a₁₁, a₂₁, a₃₁... ... a_M1Filling constitutes a feature of risk into the corresponding feature of risk field of a field name Column；By a₁₂, a₂₂, a₃₂... ... a_M2Filling constitutes feature of risk column ... into the corresponding feature of risk field of a field name； By a_1N, a_2N, a_3N... ... a_MNFilling constitutes feature of risk column into the corresponding feature of risk field of a field name.

Illustratively, by taking the corresponding first feature of risk field of field entitled " Age " as an example, a₁₁Indicate first client's sample Age, a₂₁Indicate the age of first client's sample, a₃₁Indicate the age of first client's sample,,, a_M1Indicate M client's sample This age, i.e., feature of risk column of M age composition one of M client's sample based on age information.

Each client's sample standard deviation includes N number of preparatory feature of risk item such as " age ", " data inflow data ".

Step S104 analyzes each feature of risk field corresponding feature of risk column, obtains each default risk The WoE of characteristic item.

Illustratively, the formula for calculating the WoE of each default feature of risk item is as follows:

WoE_iIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form core_iIndicate pre- If feature of risk item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections；Pz_i Indicate that default feature of risk item i protects number in the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections The ratio between amount；1≤i≤N.

With calculation risk characteristic series (a₁₁, a₂₁, a₃₁... a_M1) corresponding default feature of risk item (client age) For WoE: WoE_i(Weight ofEvidence, evidence weight) is a kind of mode that numerical value is done to sliding-model control, Py_iTable Show after characteristic series are carried out discrete processes, the high risk core of each age range protects the high risk core of quantity and whole age ranges Protect ratio of number, it may be assumed that according to a₁₁, a₂₁, a₃₁... a_M1These age datas count specified age range (such as 20~25 years old) Client's sample size for protecting of high risk core, and count client's sample size that the high risk core of whole age ranges is protected, so The high risk of the high risk core of fetching dating section (such as 20~25 years old) is protected afterwards client's sample size and whole age ranges The ratio between client's sample size that core is protected；Pz_iIndicate that the non-high risk core of each age range protects quantity and whole ages The non-high risk core in section protects ratio of number, it may be assumed that according to a₁₁, a₂₁, a₃₁... a_M1These age datas count the specified age Client's sample size that the non-high risk core in section (such as 20~25 years old) is protected, and count the non-high risk core of whole age ranges Client's sample size of guarantor, client's sample size that then the non-high risk core of fetching dating section (such as 20~25 years old) is protected Ratio between client's sample size of the non-high risk core guarantor of whole age ranges.

WoE_iNumerical value be that client based on each subinterval is distributed accounting situation and counts, each subinterval WoE_iThe high risk core guarantor quantity/non-high risk core guarantor's quantitative proportion for being each segmentation of reflection and overall high risk core protect number Amount/non-high risk core protects the difference of quantitative proportion.

Step S106 passes through the multiple sample data set and the corresponding multiple WoE of the multiple default feature of risk item More isolated trees of training, to obtain data exception detection model.

Isolated forest (iForest, isolation Forest) is made of more trees, and every tree is referred to as isolated tree (iTree, isolation tree).If T is an isolated tree, T otherwise be a not no child node external node, or have The internal node of two child nodes (Tl, Tr)；It can specify a default feature of risk item and corresponding (that is, default risk is special Levy item partition value), it presets feature of risk item partition value and is located between the maximum value and minimum value of corresponding default feature of risk item, Multiple sample data sets are divided into Tl and Tr by default feature of risk item partition value.

In the exemplary embodiment, step S106 further comprises:

Step S106A, sampling with replacement obtains multiple client's sample sets from the multiple client's sample；

Isolated tree construction step: step S106B (1) selects one of client in the multiple client's sample set Sample set；(2) it is selected one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to selected Multiple client's samples in the client's sample set selected divide；(4) (2)~(4) are repeated the above steps until described isolated Tree reaches the height limitation of setting；

Step S106B is repeated to obtain more isolated trees, the more isolated tree groups are combined into the data exception inspection Survey model.

Embodiment two

Based on the trained data exception detection model of embodiment one, embodiment two is used to protect risk assessment mould to declaration form core Type is trained.

Referring to Fig.2, showing the step flow chart of the data model training method of the embodiment of the present invention two.It is appreciated that Flow chart in this method embodiment, which is not used in, is defined the sequence for executing step.It is specific as follows.

Step S200 obtains the second training dataset, and second training dataset includes multiple target customer's samples Multiple target sample data sets.

Each target sample data set is input in the data exception detection model, passes through the number by step S202 The coefficient of variation of each target sample data set is exported according to abnormality detection model.

Target sample data set is traversed in each isolated tree, since the root node of every isolated tree, according to structure The default feature of risk item and default feature of risk item partition value selected when building this isolated tree from root node to leaf node, In, if the default feature of risk entry value for accordingly presetting feature of risk item in a certain isolated tree is less than default feature of risk item point Value is cut, then target sample data set traverses left subtree, and otherwise target sample data set traverses right subtree, until reaching leaf Node, and the quantity on the side passed by during this is recorded, obtain the path length of single isolated tree.According to target sample data Collect the path length in each isolated tree, the coefficient of variation of target sample data set is calculated.

Illustratively, n target customer's sample, the n target visitor are randomly choosed in M given client's sample Family sample corresponds to n target sample data set, and the coefficient of variation of each target sample data set is calculated by the following formula to obtain:

C (n)=2H (n-1)-(2 (n-1)/n)

H (k)=In (k)+ξ

X is used to identify one of target sample data set in n target sample data set, and s (x, n) is target sample The coefficient of variation that data set x is obtained in the isolated tree being made of n target sample data set, c (n) are target sample data set When number is n, the average path length of isolated tree, E (h (x)) is corresponding target sample data set x on the road of more isolated trees Electrical path length mean value, h (x) are path length of the corresponding target sample data set x in each isolated tree, and k is root node to leaf The path length of node, ξ are Euler Parameter, and H (k) is reconciliation number.

S (x, n) closer to 1, indicate the target sample data set be discrete abnormal point probability it is higher；S (x, n) is small In 0.5, indicate that the target sample data set is normal data set.

Step S204 is concentrated from second training data and is screened according to the coefficient of variation of each target sample data set Obtain third training dataset.

The abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold.

Step S206 protects risk evaluation model to declaration form core and is trained according to the third training dataset.

The declaration form core protects risk evaluation model, can be LR (loss function, logistic regression) model, FM (Factorization Machine, Factorization machine) model, depth network neural model or the combination of above-mentioned model.

Embodiment three

Please continue to refer to Fig. 3, the program module schematic diagram of data model training system embodiment three of the present invention is shown.In In the present embodiment, data model training system 20 may include or be divided into one or more program modules, and one or more A program module is stored in storage medium, and as performed by one or more processors, to complete the present invention, and can be realized Above-mentioned data model training method.The so-called program module of the embodiment of the present invention is a series of meters for referring to complete specific function Calculation machine program instruction section, the execution than program itself more suitable for description data model training system 20 in storage medium Journey.The function of each program module of the present embodiment will specifically be introduced by being described below:

Module 200 is obtained, for obtaining the first training dataset, first training dataset includes multiple client's samples Multiple sample data sets.

In the exemplary embodiment, the acquisition module 200, is used for: configuring mesh according to multiple default feature of risk items Multiple feature of risk fields of database are marked, and each feature of risk field and the corresponding field foundation in multiple data sources are reflected Penetrate relationship；According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source；Clearly The invalid raw data set for lacking multiple feature of risk data is washed, to concentrate acquisition effective more from the multiple initial data A sample data set.

Wherein, according to the mapping relations, multiple original numbers of multiple client's samples are extracted from the multiple data source According to collection, comprising: when the same feature of risk field of same client's sample is corresponding with multiple feature of risk numbers in multiple data sources According to when, according to the default weight coefficient of each data source, select the corresponding feature of risk data of the highest data source of weight coefficient.

Logging modle 202, it is special that multiple default risks are recorded in multiple sample datas for concentrating each sample data It levies in the corresponding multiple feature of risk fields of item, obtains multiple feature of risk column corresponding with the multiple feature of risk field.

In the exemplary embodiment,

Analysis module 204 obtains each pre- for analyzing each feature of risk field corresponding feature of risk column If the WoE of feature of risk item.

In the exemplary embodiment, the formula for calculating the WoE of each default feature of risk item is as follows:

Abnormality detection model training module 206, for passing through the multiple sample data set and the multiple default risk Characteristic item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.

In the exemplary embodiment, the abnormality detection model training module 206, is also used to: from the multiple client Sampling with replacement obtains multiple client's sample sets in sample；Isolated tree construction step: (1) the multiple client's sample is selected The one of client's sample set concentrated；(2) according to the multiple WoE select by one of them preset feature of risk item and Corresponding value, (3) divide multiple client's samples in the client's sample set selected；(4) it repeats the above steps (2)~(4) are until the isolated tree reaches the height limitation of setting；The isolated tree construction step is repeated to obtain more Isolated tree, the more isolated tree groups are combined into the data exception detection model.

In the exemplary embodiment, the system 20 further includes risk evaluation model training module 208, is used for: being obtained Second training dataset, second training dataset include multiple target sample data sets of multiple target customer's samples；It will Each target sample data set is input in the data exception detection model, is exported by the data exception detection model every The coefficient of variation of a target sample data set；According to the coefficient of variation of each target sample data set, from the second training number According to concentrating screening to obtain third training dataset, the abnormal of each target sample data set that the third training data is concentrated is Number is all larger than preset threshold；According to the third training dataset, risk evaluation model is protected to declaration form core and is trained.

The coefficient of variation of each target sample data set is calculated by the following formula to obtain:

C (n)=2H (n-1)-(2 (n-1)/n)

H (k)=In (k)+ξ

Example IV

It is the hardware structure schematic diagram of the computer equipment of the embodiment of the present invention four refering to Fig. 4.It is described in the present embodiment Computer equipment 2 is that one kind can be automatic to carry out numerical value calculating and/or information processing according to the instruction for being previously set or storing Equipment.The computer equipment 2 can be rack-mount server, blade server, tower server or Cabinet-type server (including server cluster composed by independent server or multiple servers) etc..As shown, the computer equipment 2 include at least, but are not limited to, can be in communication with each other by system bus connection memory 21, processor 22, network interface 23, with And data model training system 20.Wherein:

In the present embodiment, memory 21 includes at least a type of computer readable storage medium, the readable storage Medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..In some embodiments, memory 21 can be the internal storage unit of computer equipment 2, such as the hard disk or memory of the computer equipment 2.In other implementations In example, memory 21 is also possible to the grafting being equipped on the External memory equipment of computer equipment 2, such as the computer equipment 20 Formula hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, memory 21 can also both including computer equipment 2 internal storage unit and also including outside it Store equipment.In the present embodiment, memory 21 is installed on the operating system and types of applications of computer equipment 2 commonly used in storage Software, for example, embodiment three data model training system 20 program code etc..In addition, memory 21 can be also used for temporarily Ground stores the Various types of data that has exported or will export.

Processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU), Controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in control computer equipment 2 Overall operation.In the present embodiment, program code or processing data of the processor 22 for being stored in run memory 21, Such as operation data model training systems 20, to realize the data model training method of embodiment one or two.

The network interface 23 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the computer equipment 2 and other electronic devices.For example, the network interface 23 is for passing through network The computer equipment 2 is connected with exterior terminal, establishes data transmission between the computer equipment 2 and exterior terminal Channel and communication connection etc..The network can be intranet (Intranet), internet (Internet), whole world movement Communication system (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), the nothings such as Wi-Fi Line or cable network.

It should be pointed out that Fig. 4 illustrates only the computer equipment 2 with component 20-23, it should be understood that simultaneously All components shown realistic are not applied, the implementation that can be substituted is more or less component.

In the present embodiment, the data model training system 20 being stored in memory 21 can also be divided into one A or multiple program modules, one or more of program modules are stored in memory 21, and by one or more Processor (the present embodiment is processor 22) is performed, to complete the present invention.

For example, Fig. 3 shows the program module schematic diagram for realizing 20 embodiment three of data model training system, the reality Apply in example, the data model training system 20 can be divided into obtain module 200, logging modle 202, analysis module 204, Training module 206 and risk evaluation model training module 208.Wherein, the so-called program module of the present invention is to refer to complete spy The series of computation machine program instruction section for determining function, than program more suitable for describing the data model training system 20 described Implementation procedure in computer equipment 2.The concrete function of described program module 200-208 has in the third embodiment to be retouched in detail It states, details are not described herein.

Embodiment five

The present embodiment also provides a kind of computer readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, magnetic Disk, CD, server, App are stored thereon with computer program, phase are realized when program is executed by processor using store etc. Answer function.The computer readable storage medium of the present embodiment model training systems 20 for storing data, when being executed by processor Realize the data model training method of embodiment one or two.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of data model training method, which is characterized in that the described method includes:

The first training dataset is obtained, first training dataset includes multiple sample data sets of multiple client's samples；

It is special that the corresponding multiple risks of multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated It levies in field, obtains multiple feature of risk column corresponding with the multiple feature of risk field；

Each feature of risk field corresponding feature of risk column are analyzed, the WoE of each default feature of risk item is obtained；And

It is isolated by the multiple sample data set and the multiple default feature of risk item corresponding multiple WoE training more Tree, to obtain data exception detection model.

2. data model training method as described in claim 1, which is characterized in that the step of obtaining the first training dataset, Include:

According to multiple feature of risk fields of multiple default feature of risk items configuration target databases, and by each feature of risk word Section establishes mapping relations with the corresponding field in multiple data sources；

According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source；And

Cleaning lacks the invalid raw data set of multiple feature of risk data, has to concentrate to obtain from the multiple initial data The multiple sample data set of effect.

3. data model training method as claimed in claim 2, which is characterized in that according to the mapping relations, from described more The step of multiple raw data sets of multiple client's samples are extracted in a data source, comprising:

When the same feature of risk field of same client's sample is corresponding with multiple feature of risk data in multiple data sources, root According to the default weight coefficient of each data source, the corresponding feature of risk data of the highest data source of weight coefficient are selected.

4. declaration form core as claimed in claim 3 protects model training method, which is characterized in that calculate each default feature of risk item WoE formula it is as follows:

WoE_iIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form core_iIndicate default wind Dangerous characteristic item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections；Pz_iIt indicates Default feature of risk item i the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections protect quantity it Than.

5. data model training method as claimed in claim 4, which is characterized in that pass through the multiple sample data set and institute Multiple default feature of risk items more isolated trees of corresponding multiple WoE training are stated, the step of to obtain data exception detection model, Include:

Isolated tree construction step: (1) one of client's sample set in the multiple client's sample set is selected；(2) root It selects one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to client's sample selected The multiple client's samples concentrated divide；(4) (2)~(4) are repeated the above steps until the isolated tree reaches the height of setting Degree limitation；

The isolated tree construction step is repeated to obtain more isolated trees, it is different that the more isolated tree groups are combined into the data Normal detection model.

6. data model coaching method as claimed in claim 5, which is characterized in that further include:

The second training dataset is obtained, second training dataset includes multiple target sample numbers of multiple target customer's samples According to collection；

Each target sample data set is input in the data exception detection model, the data exception detection model is passed through Export the coefficient of variation of each target sample data set；

According to the coefficient of variation of each target sample data set, screening is concentrated to obtain third training number from second training data According to collection, the abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold；

7. data model training method as claimed in claim 6, which is characterized in that the variation lines of each target sample data set Number is calculated by the following formula to obtain:

C (n)=2H (n-1)-(2 (n-1)/n)

H (k)=In (k)+ξ

X is used to identify one of target sample data set in n target sample data set；S (x, n) is target sample data The coefficient of variation that collection x is obtained in the isolated tree being made of n target sample data set；C (n) is the number of target sample data set When for n, the average path length of isolated tree；E (h (x)) is path length of the corresponding target sample data set x in more isolated trees Spend mean value；H (x) is path length of the corresponding target sample data set x in each isolated tree；K is root node to leaf node Path length；ξ is Euler Parameter；H (k) is reconciliation number.

8. a kind of data model training system characterized by comprising

Module is obtained, for obtaining the first training dataset, first training dataset includes the multiple of multiple client's samples Sample data set；

Logging modle, it is corresponding that multiple default feature of risk items are recorded in multiple sample datas for concentrating each sample data Multiple feature of risk fields in, obtain multiple feature of risk column corresponding with the multiple feature of risk field；

Analysis module obtains each default risk for analyzing each feature of risk field corresponding feature of risk column The WoE of characteristic item；And

Abnormality detection model training module, for right by the multiple sample data set and the multiple default feature of risk item The multiple WoE more isolated trees of training answered, to obtain data exception detection model.

9. a kind of computer equipment, the computer equipment memory, processor and it is stored on the memory and can be in institute State the computer program run on processor, which is characterized in that such as right is realized when the computer program is executed by processor It is required that described in any one of 1 to 7 the step of data model training method.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Program, the computer program can be performed by least one processors, so that at least one described processor executes such as right It is required that described in any one of 1 to 7 the step of data model training method.