CN110517154A - Data model training method, system and computer equipment - Google Patents

Data model training method, system and computer equipment Download PDF

Info

Publication number
CN110517154A
CN110517154A CN201910665010.XA CN201910665010A CN110517154A CN 110517154 A CN110517154 A CN 110517154A CN 201910665010 A CN201910665010 A CN 201910665010A CN 110517154 A CN110517154 A CN 110517154A
Authority
CN
China
Prior art keywords
risk
feature
data
data set
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910665010.XA
Other languages
Chinese (zh)
Inventor
王进
刘行行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910665010.XA priority Critical patent/CN110517154A/en
Publication of CN110517154A publication Critical patent/CN110517154A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a kind of data model training methods, comprising: obtains the first training dataset, first training dataset includes multiple sample data sets of multiple client's samples;Multiple sample datas that each sample data is concentrated are recorded in the corresponding multiple feature of risk fields of multiple default feature of risk items, obtain multiple feature of risk column corresponding with the multiple feature of risk field;Each feature of risk field corresponding feature of risk column are analyzed, the WoE of each default feature of risk item is obtained;By the multiple sample data set and the multiple default feature of risk item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.The data exception detection model of the present embodiment training can effectively detect the unusual customers of declaration form core guarantor, to solve the problems, such as that part abnormal data influences the prediction accuracy of data model.

Description

Data model training method, system and computer equipment
Technical field
The present embodiments relate to field of computer data processing more particularly to a kind of data model training method, system, Computer equipment and computer readable storage medium.
Background technique
As people's insurance awareness gradually increases, business insurance has become the important composition portion of current social security system Point.According to can refer to data, the declaration form quantity of partial insurance mechanism is in ten million rank.These declaration forms generate it in insurance system Afterwards, it needs to carry out core guarantor to declaration form, whether insured requirement is met with the information in definite policy.Now to the core guarantor side of declaration form Formula, usually by manually carrying out core guarantor.With the fast development that big data is dug, for core protect to can refer to data more and more, Industry starts based on big data and carries out data modeling and carry out declaration form core guarantor by data model.
However, can refer to that data are less in view of new client, such as the finite data being only limitted on insurance application.For data For model, data are more, and analysis dimension is more, then the prediction accuracy of data model is higher.It is therefore desirable to from each canal Road obtains client's data as multiple as possible, however, channel excessively be easy to cause the quality of data poor, such as partial data exists It is abnormal, and the part abnormal data may be affected to the prediction accuracy of data model.Therefore, data matter how is promoted Amount, and then the prediction accuracy and forecasting efficiency of data model are promoted, it is one of technical problem in the urgent need to address.
Summary of the invention
In view of this, the purpose of the embodiment of the present invention is that providing a kind of data model training method, system, computer equipment And computer readable storage medium, it can solve the problem of part abnormal data influences the prediction accuracy of data model.
To achieve the above object, the embodiment of the invention provides a kind of data model training methods, comprising the following steps:
The first training dataset is obtained, first training dataset includes multiple sample datas of multiple client's samples Collection;
The corresponding multiple wind of multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated In dangerous feature field, multiple feature of risk column corresponding with the multiple feature of risk field are obtained;
Each feature of risk field corresponding feature of risk column are analyzed, each default feature of risk item is obtained WoE;And
Pass through the multiple sample data set and the multiple default feature of risk item more orphans of corresponding multiple WoE training Vertical tree, to obtain data exception detection model.
Preferably, the step of obtaining the first training dataset, comprising:
Multiple feature of risk fields of target databases are configured according to multiple default feature of risk items, and each risk is special Corresponding field in sign field and multiple data sources establishes mapping relations;
According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source; And
Cleaning lacks the invalid raw data set of multiple feature of risk data, to obtain from the multiple initial data concentration Take effective multiple sample data sets.
Preferably, according to the mapping relations, the multiple original of multiple client's samples is extracted from the multiple data source The step of data set, comprising:
When the same feature of risk field of same client's sample is corresponding with multiple feature of risk data in multiple data sources When, according to the default weight coefficient of each data source, select the corresponding feature of risk data of the highest data source of weight coefficient.
Preferably, the formula for calculating the WoE of each default feature of risk item is as follows:
WoEiIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form coreiIndicate pre- If feature of risk item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections;Pzi Indicate that default feature of risk item i protects number in the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections The ratio between amount.
Preferably, it is instructed by the corresponding multiple WoE of the multiple sample data set and the multiple default feature of risk item Practice more isolated trees, the step of to obtain data exception detection model, comprising:
Sampling with replacement obtains multiple client's sample sets from the multiple client's sample;
Isolated tree construction step: (1) one of client's sample set in the multiple client's sample set is selected; (2) it is selected one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to the client selected Multiple client's samples in sample set divide;(4) (2)~(4) are repeated the above steps until the isolated tree reaches and sets Fixed height limitation;
The isolated tree construction step is repeated to obtain more isolated trees, the more isolated tree groups are combined into the number According to abnormality detection model.
Preferably, further includes:
The second training dataset is obtained, second training dataset includes multiple target samples of multiple target customer's samples Notebook data collection;
Each target sample data set is input in the data exception detection model, is detected by the data exception Model exports the coefficient of variation of each target sample data set;
According to the coefficient of variation of each target sample data set, screening is concentrated to obtain third instruction from second training data Practice data set, the abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold;
According to the third training dataset, risk evaluation model is protected to declaration form core and is trained.
Preferably, the coefficient of variation of each target sample data set is calculated by the following formula to obtain:
C (n)=2H (n-1)-(2 (n-1)/n)
H (k)=In (k)+ξ
X is used to identify one of target sample data set in n target sample data set;S (x, n) is target sample The coefficient of variation that data set x is obtained in the isolated tree being made of n target sample data set;C (n) is target sample data set When number is n, the average path length of isolated tree;E (h (x)) is corresponding target sample data set x on the road of more isolated trees Electrical path length mean value;H (x) is path length of the corresponding target sample data set x in each isolated tree;K is root node to leaf The path length of node;ξ is Euler Parameter;H (k) is reconciliation number.
To achieve the above object, the embodiment of the invention also provides data model training systems, comprising:
Module is obtained, for obtaining the first training dataset, first training dataset includes multiple client's samples Multiple sample data sets;
Multiple default feature of risk items are recorded in logging modle, multiple sample datas for concentrating each sample data In corresponding multiple feature of risk fields, multiple feature of risk column corresponding with the multiple feature of risk field are obtained;
Analysis module obtains each default for analyzing each feature of risk field corresponding feature of risk column The WoE of feature of risk item;
Abnormality detection model training module, for passing through the multiple sample data set and the multiple default feature of risk Item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.
To achieve the above object, the embodiment of the invention also provides a kind of computer equipment, the computer equipment storages Device, processor and it is stored in the computer program that can be run on the memory and on the processor, the computer journey The step of data model training method as described above is realized when sequence is executed by processor.
To achieve the above object, the embodiment of the invention also provides a kind of computer readable storage medium, the computers Computer program is stored in readable storage medium storing program for executing, the computer program can be performed by least one processor, so that institute State the step of at least one processor executes data model training method as described above.
Data model training method, system, computer equipment and computer-readable storage medium provided in an embodiment of the present invention Matter, by more isolated trees of multiple default corresponding feature of risk data training of feature of risk item in multiple client's samples, with structure Build the data exception detection model being made of multiple isolated tree.Constructed data exception detection model can be effectively detected The unusual customers protected to declaration form core are promoted to solve the problems, such as that part abnormal data influences the prediction accuracy of data model Forecasting efficiency.
Detailed description of the invention
Fig. 1 is the flow chart of data model training method embodiment one of the present invention.
Fig. 2 is the flow chart of data model training method embodiment two of the present invention.
Fig. 3 is the program module schematic diagram of data model training system embodiment three of the present invention.
Fig. 4 is the hardware structural diagram of computer equipment example IV of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work Every other embodiment obtained is put, shall fall within the protection scope of the present invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as its relative importance of indication or suggestion or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must be based on can be realized by those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection scope within.
Following embodiment will be that executing subject carries out exemplary description with computer equipment 2.
Embodiment one
Refering to fig. 1, the step flow chart of the data model training method of the embodiment of the present invention one is shown.It is appreciated that Flow chart in this method embodiment, which is not used in, is defined the sequence for executing step.It is specific as follows.
Step S100 obtains the first training dataset, and first training dataset includes the multiple of multiple client's samples Sample data set.
Specifically, the acquisition of multiple sample data sets may include a variety of sources, such as intra-company's database or outside Third party database, the third database can be the business database of third research firm, be also possible to credit investigation system Common data base.Each sample data set may include: corresponding client's sample client's essential information (gender, the age, occupation, Educational background etc.), social security information, policy information (e.g., declaration form protection amount, insurance kind information, with social relationships of income people etc.), family believe (e.g., the fund moon flows into label, the fund moon for breath, credit information (e.g., credit information, overdue loan information etc.), funds flow information Flow out label etc.) and all kinds of behavioral datas, all kinds of behavioral datas include but is not limited to that internet information (e.g., believe by buying behavior Breath).
In the exemplary embodiment, step S100 further comprises: carrying out ETL to the initial data of each data source (Extract-Transformation-Loading, extraction, conversion and the load of data) operation, to obtain the multiple sample Data set.It is specific as follows.
Step S100A, according to multiple feature of risk fields of multiple default feature of risk item configuration target databases, and will Corresponding field in each feature of risk field and multiple data sources establishes mapping relations.
Each feature of risk field may map multiple fields of multiple data sources.
Step S100B extracts multiple originals of multiple client's samples according to the mapping relations from the multiple data source Beginning data set.
Illustratively, when the same feature of risk field of same client's sample is corresponding with multiple risks in multiple data sources When characteristic, according to the default weight coefficient of each data source, select the corresponding risk of the highest data source of weight coefficient special Levy data.
Step S100C, cleaning lack the invalid raw data set of multiple feature of risk data, with from the multiple original Effective multiple sample data sets are obtained in data set.
Step S102, it is corresponding that multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated Multiple feature of risk fields in, obtain multiple feature of risk column corresponding with the multiple feature of risk field.
The corresponding sample data set of each client's sample, it includes multiple sample datas that each sample data, which is concentrated, often A sample data is filled into corresponding feature of risk field, and multiple sample datas of the multiple sample data set can be with structure At N number of characteristic series, N number of characteristic series correspond to N number of default feature of risk item, such as:
By a11, a21, a31... ... aM1Filling constitutes a feature of risk into the corresponding feature of risk field of a field name Column;By a12, a22, a32... ... aM2Filling constitutes feature of risk column ... into the corresponding feature of risk field of a field name; By a1N, a2N, a3N... ... aMNFilling constitutes feature of risk column into the corresponding feature of risk field of a field name.
Illustratively, by taking the corresponding first feature of risk field of field entitled " Age " as an example, a11Indicate first client's sample Age, a21Indicate the age of first client's sample, a31Indicate the age of first client's sample,,, aM1Indicate M client's sample This age, i.e., feature of risk column of M age composition one of M client's sample based on age information.
Each client's sample standard deviation includes N number of preparatory feature of risk item such as " age ", " data inflow data ".
Step S104 analyzes each feature of risk field corresponding feature of risk column, obtains each default risk The WoE of characteristic item.
Illustratively, the formula for calculating the WoE of each default feature of risk item is as follows:
WoEiIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form coreiIndicate pre- If feature of risk item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections;Pzi Indicate that default feature of risk item i protects number in the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections The ratio between amount;1≤i≤N.
With calculation risk characteristic series (a11, a21, a31... aM1) corresponding default feature of risk item (client age) For WoE: WoEi(Weight ofEvidence, evidence weight) is a kind of mode that numerical value is done to sliding-model control, PyiTable Show after characteristic series are carried out discrete processes, the high risk core of each age range protects the high risk core of quantity and whole age ranges Protect ratio of number, it may be assumed that according to a11, a21, a31... aM1These age datas count specified age range (such as 20~25 years old) Client's sample size for protecting of high risk core, and count client's sample size that the high risk core of whole age ranges is protected, so The high risk of the high risk core of fetching dating section (such as 20~25 years old) is protected afterwards client's sample size and whole age ranges The ratio between client's sample size that core is protected;PziIndicate that the non-high risk core of each age range protects quantity and whole ages The non-high risk core in section protects ratio of number, it may be assumed that according to a11, a21, a31... aM1These age datas count the specified age Client's sample size that the non-high risk core in section (such as 20~25 years old) is protected, and count the non-high risk core of whole age ranges Client's sample size of guarantor, client's sample size that then the non-high risk core of fetching dating section (such as 20~25 years old) is protected Ratio between client's sample size of the non-high risk core guarantor of whole age ranges.
WoEiNumerical value be that client based on each subinterval is distributed accounting situation and counts, each subinterval WoEiThe high risk core guarantor quantity/non-high risk core guarantor's quantitative proportion for being each segmentation of reflection and overall high risk core protect number Amount/non-high risk core protects the difference of quantitative proportion.
Step S106 passes through the multiple sample data set and the corresponding multiple WoE of the multiple default feature of risk item More isolated trees of training, to obtain data exception detection model.
Isolated forest (iForest, isolation Forest) is made of more trees, and every tree is referred to as isolated tree (iTree, isolation tree).If T is an isolated tree, T otherwise be a not no child node external node, or have The internal node of two child nodes (Tl, Tr);It can specify a default feature of risk item and corresponding (that is, default risk is special Levy item partition value), it presets feature of risk item partition value and is located between the maximum value and minimum value of corresponding default feature of risk item, Multiple sample data sets are divided into Tl and Tr by default feature of risk item partition value.
In the exemplary embodiment, step S106 further comprises:
Step S106A, sampling with replacement obtains multiple client's sample sets from the multiple client's sample;
Isolated tree construction step: step S106B (1) selects one of client in the multiple client's sample set Sample set;(2) it is selected one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to selected Multiple client's samples in the client's sample set selected divide;(4) (2)~(4) are repeated the above steps until described isolated Tree reaches the height limitation of setting;
Step S106B is repeated to obtain more isolated trees, the more isolated tree groups are combined into the data exception inspection Survey model.
Embodiment two
Based on the trained data exception detection model of embodiment one, embodiment two is used to protect risk assessment mould to declaration form core Type is trained.
Referring to Fig.2, showing the step flow chart of the data model training method of the embodiment of the present invention two.It is appreciated that Flow chart in this method embodiment, which is not used in, is defined the sequence for executing step.It is specific as follows.
Step S200 obtains the second training dataset, and second training dataset includes multiple target customer's samples Multiple target sample data sets.
Each target sample data set is input in the data exception detection model, passes through the number by step S202 The coefficient of variation of each target sample data set is exported according to abnormality detection model.
Target sample data set is traversed in each isolated tree, since the root node of every isolated tree, according to structure The default feature of risk item and default feature of risk item partition value selected when building this isolated tree from root node to leaf node, In, if the default feature of risk entry value for accordingly presetting feature of risk item in a certain isolated tree is less than default feature of risk item point Value is cut, then target sample data set traverses left subtree, and otherwise target sample data set traverses right subtree, until reaching leaf Node, and the quantity on the side passed by during this is recorded, obtain the path length of single isolated tree.According to target sample data Collect the path length in each isolated tree, the coefficient of variation of target sample data set is calculated.
Illustratively, n target customer's sample, the n target visitor are randomly choosed in M given client's sample Family sample corresponds to n target sample data set, and the coefficient of variation of each target sample data set is calculated by the following formula to obtain:
C (n)=2H (n-1)-(2 (n-1)/n)
H (k)=In (k)+ξ
X is used to identify one of target sample data set in n target sample data set, and s (x, n) is target sample The coefficient of variation that data set x is obtained in the isolated tree being made of n target sample data set, c (n) are target sample data set When number is n, the average path length of isolated tree, E (h (x)) is corresponding target sample data set x on the road of more isolated trees Electrical path length mean value, h (x) are path length of the corresponding target sample data set x in each isolated tree, and k is root node to leaf The path length of node, ξ are Euler Parameter, and H (k) is reconciliation number.
S (x, n) closer to 1, indicate the target sample data set be discrete abnormal point probability it is higher;S (x, n) is small In 0.5, indicate that the target sample data set is normal data set.
Step S204 is concentrated from second training data and is screened according to the coefficient of variation of each target sample data set Obtain third training dataset.
The abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold.
Step S206 protects risk evaluation model to declaration form core and is trained according to the third training dataset.
The declaration form core protects risk evaluation model, can be LR (loss function, logistic regression) model, FM (Factorization Machine, Factorization machine) model, depth network neural model or the combination of above-mentioned model.
Embodiment three
Please continue to refer to Fig. 3, the program module schematic diagram of data model training system embodiment three of the present invention is shown.In In the present embodiment, data model training system 20 may include or be divided into one or more program modules, and one or more A program module is stored in storage medium, and as performed by one or more processors, to complete the present invention, and can be realized Above-mentioned data model training method.The so-called program module of the embodiment of the present invention is a series of meters for referring to complete specific function Calculation machine program instruction section, the execution than program itself more suitable for description data model training system 20 in storage medium Journey.The function of each program module of the present embodiment will specifically be introduced by being described below:
Module 200 is obtained, for obtaining the first training dataset, first training dataset includes multiple client's samples Multiple sample data sets.
In the exemplary embodiment, the acquisition module 200, is used for: configuring mesh according to multiple default feature of risk items Multiple feature of risk fields of database are marked, and each feature of risk field and the corresponding field foundation in multiple data sources are reflected Penetrate relationship;According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source;Clearly The invalid raw data set for lacking multiple feature of risk data is washed, to concentrate acquisition effective more from the multiple initial data A sample data set.
Wherein, according to the mapping relations, multiple original numbers of multiple client's samples are extracted from the multiple data source According to collection, comprising: when the same feature of risk field of same client's sample is corresponding with multiple feature of risk numbers in multiple data sources According to when, according to the default weight coefficient of each data source, select the corresponding feature of risk data of the highest data source of weight coefficient.
Logging modle 202, it is special that multiple default risks are recorded in multiple sample datas for concentrating each sample data It levies in the corresponding multiple feature of risk fields of item, obtains multiple feature of risk column corresponding with the multiple feature of risk field.
In the exemplary embodiment,
Analysis module 204 obtains each pre- for analyzing each feature of risk field corresponding feature of risk column If the WoE of feature of risk item.
In the exemplary embodiment, the formula for calculating the WoE of each default feature of risk item is as follows:
WoEiIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form coreiIndicate pre- If feature of risk item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections;Pzi Indicate that default feature of risk item i protects number in the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections The ratio between amount.
Abnormality detection model training module 206, for passing through the multiple sample data set and the multiple default risk Characteristic item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.
In the exemplary embodiment, the abnormality detection model training module 206, is also used to: from the multiple client Sampling with replacement obtains multiple client's sample sets in sample;Isolated tree construction step: (1) the multiple client's sample is selected The one of client's sample set concentrated;(2) according to the multiple WoE select by one of them preset feature of risk item and Corresponding value, (3) divide multiple client's samples in the client's sample set selected;(4) it repeats the above steps (2)~(4) are until the isolated tree reaches the height limitation of setting;The isolated tree construction step is repeated to obtain more Isolated tree, the more isolated tree groups are combined into the data exception detection model.
In the exemplary embodiment, the system 20 further includes risk evaluation model training module 208, is used for: being obtained Second training dataset, second training dataset include multiple target sample data sets of multiple target customer's samples;It will Each target sample data set is input in the data exception detection model, is exported by the data exception detection model every The coefficient of variation of a target sample data set;According to the coefficient of variation of each target sample data set, from the second training number According to concentrating screening to obtain third training dataset, the abnormal of each target sample data set that the third training data is concentrated is Number is all larger than preset threshold;According to the third training dataset, risk evaluation model is protected to declaration form core and is trained.
The coefficient of variation of each target sample data set is calculated by the following formula to obtain:
C (n)=2H (n-1)-(2 (n-1)/n)
H (k)=In (k)+ξ
X is used to identify one of target sample data set in n target sample data set;S (x, n) is target sample The coefficient of variation that data set x is obtained in the isolated tree being made of n target sample data set;C (n) is target sample data set When number is n, the average path length of isolated tree;E (h (x)) is corresponding target sample data set x on the road of more isolated trees Electrical path length mean value;H (x) is path length of the corresponding target sample data set x in each isolated tree;K is root node to leaf The path length of node;ξ is Euler Parameter;H (k) is reconciliation number.
Example IV
It is the hardware structure schematic diagram of the computer equipment of the embodiment of the present invention four refering to Fig. 4.It is described in the present embodiment Computer equipment 2 is that one kind can be automatic to carry out numerical value calculating and/or information processing according to the instruction for being previously set or storing Equipment.The computer equipment 2 can be rack-mount server, blade server, tower server or Cabinet-type server (including server cluster composed by independent server or multiple servers) etc..As shown, the computer equipment 2 include at least, but are not limited to, can be in communication with each other by system bus connection memory 21, processor 22, network interface 23, with And data model training system 20.Wherein:
In the present embodiment, memory 21 includes at least a type of computer readable storage medium, the readable storage Medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..In some embodiments, memory 21 can be the internal storage unit of computer equipment 2, such as the hard disk or memory of the computer equipment 2.In other implementations In example, memory 21 is also possible to the grafting being equipped on the External memory equipment of computer equipment 2, such as the computer equipment 20 Formula hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, memory 21 can also both including computer equipment 2 internal storage unit and also including outside it Store equipment.In the present embodiment, memory 21 is installed on the operating system and types of applications of computer equipment 2 commonly used in storage Software, for example, embodiment three data model training system 20 program code etc..In addition, memory 21 can be also used for temporarily Ground stores the Various types of data that has exported or will export.
Processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU), Controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in control computer equipment 2 Overall operation.In the present embodiment, program code or processing data of the processor 22 for being stored in run memory 21, Such as operation data model training systems 20, to realize the data model training method of embodiment one or two.
The network interface 23 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the computer equipment 2 and other electronic devices.For example, the network interface 23 is for passing through network The computer equipment 2 is connected with exterior terminal, establishes data transmission between the computer equipment 2 and exterior terminal Channel and communication connection etc..The network can be intranet (Intranet), internet (Internet), whole world movement Communication system (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), the nothings such as Wi-Fi Line or cable network.
It should be pointed out that Fig. 4 illustrates only the computer equipment 2 with component 20-23, it should be understood that simultaneously All components shown realistic are not applied, the implementation that can be substituted is more or less component.
In the present embodiment, the data model training system 20 being stored in memory 21 can also be divided into one A or multiple program modules, one or more of program modules are stored in memory 21, and by one or more Processor (the present embodiment is processor 22) is performed, to complete the present invention.
For example, Fig. 3 shows the program module schematic diagram for realizing 20 embodiment three of data model training system, the reality Apply in example, the data model training system 20 can be divided into obtain module 200, logging modle 202, analysis module 204, Training module 206 and risk evaluation model training module 208.Wherein, the so-called program module of the present invention is to refer to complete spy The series of computation machine program instruction section for determining function, than program more suitable for describing the data model training system 20 described Implementation procedure in computer equipment 2.The concrete function of described program module 200-208 has in the third embodiment to be retouched in detail It states, details are not described herein.
Embodiment five
The present embodiment also provides a kind of computer readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, magnetic Disk, CD, server, App are stored thereon with computer program, phase are realized when program is executed by processor using store etc. Answer function.The computer readable storage medium of the present embodiment model training systems 20 for storing data, when being executed by processor Realize the data model training method of embodiment one or two.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of data model training method, which is characterized in that the described method includes:
The first training dataset is obtained, first training dataset includes multiple sample data sets of multiple client's samples;
It is special that the corresponding multiple risks of multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated It levies in field, obtains multiple feature of risk column corresponding with the multiple feature of risk field;
Each feature of risk field corresponding feature of risk column are analyzed, the WoE of each default feature of risk item is obtained;And
It is isolated by the multiple sample data set and the multiple default feature of risk item corresponding multiple WoE training more Tree, to obtain data exception detection model.
2. data model training method as described in claim 1, which is characterized in that the step of obtaining the first training dataset, Include:
According to multiple feature of risk fields of multiple default feature of risk items configuration target databases, and by each feature of risk word Section establishes mapping relations with the corresponding field in multiple data sources;
According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source;And
Cleaning lacks the invalid raw data set of multiple feature of risk data, has to concentrate to obtain from the multiple initial data The multiple sample data set of effect.
3. data model training method as claimed in claim 2, which is characterized in that according to the mapping relations, from described more The step of multiple raw data sets of multiple client's samples are extracted in a data source, comprising:
When the same feature of risk field of same client's sample is corresponding with multiple feature of risk data in multiple data sources, root According to the default weight coefficient of each data source, the corresponding feature of risk data of the highest data source of weight coefficient are selected.
4. declaration form core as claimed in claim 3 protects model training method, which is characterized in that calculate each default feature of risk item WoE formula it is as follows:
WoEiIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form coreiIndicate default wind Dangerous characteristic item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections;PziIt indicates Default feature of risk item i the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections protect quantity it Than.
5. data model training method as claimed in claim 4, which is characterized in that pass through the multiple sample data set and institute Multiple default feature of risk items more isolated trees of corresponding multiple WoE training are stated, the step of to obtain data exception detection model, Include:
Sampling with replacement obtains multiple client's sample sets from the multiple client's sample;
Isolated tree construction step: (1) one of client's sample set in the multiple client's sample set is selected;(2) root It selects one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to client's sample selected The multiple client's samples concentrated divide;(4) (2)~(4) are repeated the above steps until the isolated tree reaches the height of setting Degree limitation;
The isolated tree construction step is repeated to obtain more isolated trees, it is different that the more isolated tree groups are combined into the data Normal detection model.
6. data model coaching method as claimed in claim 5, which is characterized in that further include:
The second training dataset is obtained, second training dataset includes multiple target sample numbers of multiple target customer's samples According to collection;
Each target sample data set is input in the data exception detection model, the data exception detection model is passed through Export the coefficient of variation of each target sample data set;
According to the coefficient of variation of each target sample data set, screening is concentrated to obtain third training number from second training data According to collection, the abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold;
According to the third training dataset, risk evaluation model is protected to declaration form core and is trained.
7. data model training method as claimed in claim 6, which is characterized in that the variation lines of each target sample data set Number is calculated by the following formula to obtain:
C (n)=2H (n-1)-(2 (n-1)/n)
H (k)=In (k)+ξ
X is used to identify one of target sample data set in n target sample data set;S (x, n) is target sample data The coefficient of variation that collection x is obtained in the isolated tree being made of n target sample data set;C (n) is the number of target sample data set When for n, the average path length of isolated tree;E (h (x)) is path length of the corresponding target sample data set x in more isolated trees Spend mean value;H (x) is path length of the corresponding target sample data set x in each isolated tree;K is root node to leaf node Path length;ξ is Euler Parameter;H (k) is reconciliation number.
8. a kind of data model training system characterized by comprising
Module is obtained, for obtaining the first training dataset, first training dataset includes the multiple of multiple client's samples Sample data set;
Logging modle, it is corresponding that multiple default feature of risk items are recorded in multiple sample datas for concentrating each sample data Multiple feature of risk fields in, obtain multiple feature of risk column corresponding with the multiple feature of risk field;
Analysis module obtains each default risk for analyzing each feature of risk field corresponding feature of risk column The WoE of characteristic item;And
Abnormality detection model training module, for right by the multiple sample data set and the multiple default feature of risk item The multiple WoE more isolated trees of training answered, to obtain data exception detection model.
9. a kind of computer equipment, the computer equipment memory, processor and it is stored on the memory and can be in institute State the computer program run on processor, which is characterized in that such as right is realized when the computer program is executed by processor It is required that described in any one of 1 to 7 the step of data model training method.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Program, the computer program can be performed by least one processors, so that at least one described processor executes such as right It is required that described in any one of 1 to 7 the step of data model training method.
CN201910665010.XA 2019-07-23 2019-07-23 Data model training method, system and computer equipment Pending CN110517154A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910665010.XA CN110517154A (en) 2019-07-23 2019-07-23 Data model training method, system and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910665010.XA CN110517154A (en) 2019-07-23 2019-07-23 Data model training method, system and computer equipment

Publications (1)

Publication Number Publication Date
CN110517154A true CN110517154A (en) 2019-11-29

Family

ID=68623317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910665010.XA Pending CN110517154A (en) 2019-07-23 2019-07-23 Data model training method, system and computer equipment

Country Status (1)

Country Link
CN (1) CN110517154A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090685A (en) * 2019-12-19 2020-05-01 第四范式(北京)技术有限公司 Method and device for detecting data abnormal characteristics
CN112560938A (en) * 2020-12-11 2021-03-26 上海哔哩哔哩科技有限公司 Model training method and device and computer equipment
CN113065610A (en) * 2019-12-12 2021-07-02 支付宝(杭州)信息技术有限公司 Isolated forest model construction and prediction method and device based on federal learning
CN113516513A (en) * 2021-07-20 2021-10-19 重庆度小满优扬科技有限公司 Data analysis method and device, computer equipment and storage medium
CN114219026A (en) * 2021-12-15 2022-03-22 中兴通讯股份有限公司 Data processing method and device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868878A (en) * 2015-01-21 2016-08-17 阿里巴巴集团控股有限公司 Method and device for MAC (Media Access Control) address risk identification
CN108549954A (en) * 2018-03-26 2018-09-18 平安科技(深圳)有限公司 Risk model training method, risk identification method, device, equipment and medium
CN108985632A (en) * 2018-07-16 2018-12-11 国网上海市电力公司 A kind of electricity consumption data abnormality detection model based on isolated forest algorithm
CN109345137A (en) * 2018-10-22 2019-02-15 广东精点数据科技股份有限公司 A kind of rejecting outliers method based on agriculture big data
CN109948738A (en) * 2019-04-11 2019-06-28 合肥工业大学 Energy consumption method for detecting abnormality, the apparatus and system of coating drying room

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868878A (en) * 2015-01-21 2016-08-17 阿里巴巴集团控股有限公司 Method and device for MAC (Media Access Control) address risk identification
CN108549954A (en) * 2018-03-26 2018-09-18 平安科技(深圳)有限公司 Risk model training method, risk identification method, device, equipment and medium
CN108985632A (en) * 2018-07-16 2018-12-11 国网上海市电力公司 A kind of electricity consumption data abnormality detection model based on isolated forest algorithm
CN109345137A (en) * 2018-10-22 2019-02-15 广东精点数据科技股份有限公司 A kind of rejecting outliers method based on agriculture big data
CN109948738A (en) * 2019-04-11 2019-06-28 合肥工业大学 Energy consumption method for detecting abnormality, the apparatus and system of coating drying room

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065610A (en) * 2019-12-12 2021-07-02 支付宝(杭州)信息技术有限公司 Isolated forest model construction and prediction method and device based on federal learning
CN113065610B (en) * 2019-12-12 2022-05-17 支付宝(杭州)信息技术有限公司 Isolated forest model construction and prediction method and device based on federal learning
CN111090685A (en) * 2019-12-19 2020-05-01 第四范式(北京)技术有限公司 Method and device for detecting data abnormal characteristics
CN111090685B (en) * 2019-12-19 2023-08-22 第四范式(北京)技术有限公司 Method and device for detecting abnormal characteristics of data
CN112560938A (en) * 2020-12-11 2021-03-26 上海哔哩哔哩科技有限公司 Model training method and device and computer equipment
CN112560938B (en) * 2020-12-11 2023-08-25 上海哔哩哔哩科技有限公司 Model training method and device and computer equipment
CN113516513A (en) * 2021-07-20 2021-10-19 重庆度小满优扬科技有限公司 Data analysis method and device, computer equipment and storage medium
CN114219026A (en) * 2021-12-15 2022-03-22 中兴通讯股份有限公司 Data processing method and device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110516910A (en) Declaration form core based on big data protects model training method and core protects methods of risk assessment
CN110517154A (en) Data model training method, system and computer equipment
CN108446281B (en) Method, device and storage medium for determining user intimacy
Martínez‐Meyer et al. Conservatism of ecological niche characteristics in North American plant species over the Pleistocene‐to‐Recent transition
CN110196834A (en) It is a kind of for data item, file, database to mark method and system
CN111159428A (en) Method and device for automatically extracting event relation of knowledge graph in economic field
CN111179055B (en) Credit line adjusting method and device and electronic equipment
CN115547466B (en) Medical institution registration and review system and method based on big data
CN112016793B (en) Resource allocation method and device based on target user group and electronic equipment
CN112508456A (en) Food safety risk assessment method, system, computer equipment and storage medium
Barankin et al. Evidence-driven approach for assessing social vulnerability and equality during extreme climatic events
CN115936895A (en) Risk assessment method, device and equipment based on artificial intelligence and storage medium
Bateman et al. The The Supervised Learning Workshop: A New, Interactive Approach to Understanding Supervised Learning Algorithms
CN115495711A (en) Natural disaster insurance prediction processing method and system and electronic equipment
CN114493853A (en) Credit rating evaluation method, credit rating evaluation device, electronic device and storage medium
CN117934154A (en) Transaction risk prediction method, model training method, device, equipment, medium and program product
CN110837604B (en) Data analysis method and device based on housing monitoring platform
CN112446777B (en) Credit evaluation method, device, equipment and storage medium
CN112100165B (en) Traffic data processing method, system, equipment and medium based on quality assessment
CN114581219A (en) Anti-telecommunication network fraud early warning method and system
CN113255806A (en) Sample feature determination method, sample feature determination device and electronic equipment
CN114372867A (en) User credit verification and evaluation method and device and computer equipment
CN113436023A (en) Financial product recommendation method and device based on block chain
CN113379212A (en) Block chain-based logistics information platform default risk assessment method, device, equipment and medium
Phuong et al. Multiple trend tests on air temperature and precipitation anomalies in Vietnam

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191129

RJ01 Rejection of invention patent application after publication