CN110517154A - Data model training method, system and computer equipment - Google Patents
Data model training method, system and computer equipment Download PDFInfo
- Publication number
- CN110517154A CN110517154A CN201910665010.XA CN201910665010A CN110517154A CN 110517154 A CN110517154 A CN 110517154A CN 201910665010 A CN201910665010 A CN 201910665010A CN 110517154 A CN110517154 A CN 110517154A
- Authority
- CN
- China
- Prior art keywords
- risk
- feature
- data
- data set
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 94
- 238000013499 data model Methods 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000001514 detection method Methods 0.000 claims abstract description 27
- 241001269238 Data Species 0.000 claims abstract description 14
- 230000002159 abnormal effect Effects 0.000 claims abstract description 10
- 238000003860 storage Methods 0.000 claims description 20
- 238000013507 mapping Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 7
- 238000013210 evaluation model Methods 0.000 claims description 7
- 230000005856 abnormality Effects 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 239000012141 concentrate Substances 0.000 claims description 2
- 230000000694 effects Effects 0.000 claims 1
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000005192 partition Methods 0.000 description 4
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000011068 loading method Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2433—Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/08—Insurance
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Economics (AREA)
- General Business, Economics & Management (AREA)
- Technology Law (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Development Economics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the invention provides a kind of data model training methods, comprising: obtains the first training dataset, first training dataset includes multiple sample data sets of multiple client's samples;Multiple sample datas that each sample data is concentrated are recorded in the corresponding multiple feature of risk fields of multiple default feature of risk items, obtain multiple feature of risk column corresponding with the multiple feature of risk field;Each feature of risk field corresponding feature of risk column are analyzed, the WoE of each default feature of risk item is obtained;By the multiple sample data set and the multiple default feature of risk item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.The data exception detection model of the present embodiment training can effectively detect the unusual customers of declaration form core guarantor, to solve the problems, such as that part abnormal data influences the prediction accuracy of data model.
Description
Technical field
The present embodiments relate to field of computer data processing more particularly to a kind of data model training method, system,
Computer equipment and computer readable storage medium.
Background technique
As people's insurance awareness gradually increases, business insurance has become the important composition portion of current social security system
Point.According to can refer to data, the declaration form quantity of partial insurance mechanism is in ten million rank.These declaration forms generate it in insurance system
Afterwards, it needs to carry out core guarantor to declaration form, whether insured requirement is met with the information in definite policy.Now to the core guarantor side of declaration form
Formula, usually by manually carrying out core guarantor.With the fast development that big data is dug, for core protect to can refer to data more and more,
Industry starts based on big data and carries out data modeling and carry out declaration form core guarantor by data model.
However, can refer to that data are less in view of new client, such as the finite data being only limitted on insurance application.For data
For model, data are more, and analysis dimension is more, then the prediction accuracy of data model is higher.It is therefore desirable to from each canal
Road obtains client's data as multiple as possible, however, channel excessively be easy to cause the quality of data poor, such as partial data exists
It is abnormal, and the part abnormal data may be affected to the prediction accuracy of data model.Therefore, data matter how is promoted
Amount, and then the prediction accuracy and forecasting efficiency of data model are promoted, it is one of technical problem in the urgent need to address.
Summary of the invention
In view of this, the purpose of the embodiment of the present invention is that providing a kind of data model training method, system, computer equipment
And computer readable storage medium, it can solve the problem of part abnormal data influences the prediction accuracy of data model.
To achieve the above object, the embodiment of the invention provides a kind of data model training methods, comprising the following steps:
The first training dataset is obtained, first training dataset includes multiple sample datas of multiple client's samples
Collection;
The corresponding multiple wind of multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated
In dangerous feature field, multiple feature of risk column corresponding with the multiple feature of risk field are obtained;
Each feature of risk field corresponding feature of risk column are analyzed, each default feature of risk item is obtained
WoE;And
Pass through the multiple sample data set and the multiple default feature of risk item more orphans of corresponding multiple WoE training
Vertical tree, to obtain data exception detection model.
Preferably, the step of obtaining the first training dataset, comprising:
Multiple feature of risk fields of target databases are configured according to multiple default feature of risk items, and each risk is special
Corresponding field in sign field and multiple data sources establishes mapping relations;
According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source;
And
Cleaning lacks the invalid raw data set of multiple feature of risk data, to obtain from the multiple initial data concentration
Take effective multiple sample data sets.
Preferably, according to the mapping relations, the multiple original of multiple client's samples is extracted from the multiple data source
The step of data set, comprising:
When the same feature of risk field of same client's sample is corresponding with multiple feature of risk data in multiple data sources
When, according to the default weight coefficient of each data source, select the corresponding feature of risk data of the highest data source of weight coefficient.
Preferably, the formula for calculating the WoE of each default feature of risk item is as follows:
WoEiIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form coreiIndicate pre-
If feature of risk item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections;Pzi
Indicate that default feature of risk item i protects number in the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections
The ratio between amount.
Preferably, it is instructed by the corresponding multiple WoE of the multiple sample data set and the multiple default feature of risk item
Practice more isolated trees, the step of to obtain data exception detection model, comprising:
Sampling with replacement obtains multiple client's sample sets from the multiple client's sample;
Isolated tree construction step: (1) one of client's sample set in the multiple client's sample set is selected;
(2) it is selected one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to the client selected
Multiple client's samples in sample set divide;(4) (2)~(4) are repeated the above steps until the isolated tree reaches and sets
Fixed height limitation;
The isolated tree construction step is repeated to obtain more isolated trees, the more isolated tree groups are combined into the number
According to abnormality detection model.
Preferably, further includes:
The second training dataset is obtained, second training dataset includes multiple target samples of multiple target customer's samples
Notebook data collection;
Each target sample data set is input in the data exception detection model, is detected by the data exception
Model exports the coefficient of variation of each target sample data set;
According to the coefficient of variation of each target sample data set, screening is concentrated to obtain third instruction from second training data
Practice data set, the abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold;
According to the third training dataset, risk evaluation model is protected to declaration form core and is trained.
Preferably, the coefficient of variation of each target sample data set is calculated by the following formula to obtain:
C (n)=2H (n-1)-(2 (n-1)/n)
H (k)=In (k)+ξ
X is used to identify one of target sample data set in n target sample data set;S (x, n) is target sample
The coefficient of variation that data set x is obtained in the isolated tree being made of n target sample data set;C (n) is target sample data set
When number is n, the average path length of isolated tree;E (h (x)) is corresponding target sample data set x on the road of more isolated trees
Electrical path length mean value;H (x) is path length of the corresponding target sample data set x in each isolated tree;K is root node to leaf
The path length of node;ξ is Euler Parameter;H (k) is reconciliation number.
To achieve the above object, the embodiment of the invention also provides data model training systems, comprising:
Module is obtained, for obtaining the first training dataset, first training dataset includes multiple client's samples
Multiple sample data sets;
Multiple default feature of risk items are recorded in logging modle, multiple sample datas for concentrating each sample data
In corresponding multiple feature of risk fields, multiple feature of risk column corresponding with the multiple feature of risk field are obtained;
Analysis module obtains each default for analyzing each feature of risk field corresponding feature of risk column
The WoE of feature of risk item;
Abnormality detection model training module, for passing through the multiple sample data set and the multiple default feature of risk
Item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.
To achieve the above object, the embodiment of the invention also provides a kind of computer equipment, the computer equipment storages
Device, processor and it is stored in the computer program that can be run on the memory and on the processor, the computer journey
The step of data model training method as described above is realized when sequence is executed by processor.
To achieve the above object, the embodiment of the invention also provides a kind of computer readable storage medium, the computers
Computer program is stored in readable storage medium storing program for executing, the computer program can be performed by least one processor, so that institute
State the step of at least one processor executes data model training method as described above.
Data model training method, system, computer equipment and computer-readable storage medium provided in an embodiment of the present invention
Matter, by more isolated trees of multiple default corresponding feature of risk data training of feature of risk item in multiple client's samples, with structure
Build the data exception detection model being made of multiple isolated tree.Constructed data exception detection model can be effectively detected
The unusual customers protected to declaration form core are promoted to solve the problems, such as that part abnormal data influences the prediction accuracy of data model
Forecasting efficiency.
Detailed description of the invention
Fig. 1 is the flow chart of data model training method embodiment one of the present invention.
Fig. 2 is the flow chart of data model training method embodiment two of the present invention.
Fig. 3 is the program module schematic diagram of data model training system embodiment three of the present invention.
Fig. 4 is the hardware structural diagram of computer equipment example IV of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work
Every other embodiment obtained is put, shall fall within the protection scope of the present invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot
It is interpreted as its relative importance of indication or suggestion or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the
One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment
Art scheme can be combined with each other, but must be based on can be realized by those of ordinary skill in the art, when technical solution
Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims
Protection scope within.
Following embodiment will be that executing subject carries out exemplary description with computer equipment 2.
Embodiment one
Refering to fig. 1, the step flow chart of the data model training method of the embodiment of the present invention one is shown.It is appreciated that
Flow chart in this method embodiment, which is not used in, is defined the sequence for executing step.It is specific as follows.
Step S100 obtains the first training dataset, and first training dataset includes the multiple of multiple client's samples
Sample data set.
Specifically, the acquisition of multiple sample data sets may include a variety of sources, such as intra-company's database or outside
Third party database, the third database can be the business database of third research firm, be also possible to credit investigation system
Common data base.Each sample data set may include: corresponding client's sample client's essential information (gender, the age, occupation,
Educational background etc.), social security information, policy information (e.g., declaration form protection amount, insurance kind information, with social relationships of income people etc.), family believe
(e.g., the fund moon flows into label, the fund moon for breath, credit information (e.g., credit information, overdue loan information etc.), funds flow information
Flow out label etc.) and all kinds of behavioral datas, all kinds of behavioral datas include but is not limited to that internet information (e.g., believe by buying behavior
Breath).
In the exemplary embodiment, step S100 further comprises: carrying out ETL to the initial data of each data source
(Extract-Transformation-Loading, extraction, conversion and the load of data) operation, to obtain the multiple sample
Data set.It is specific as follows.
Step S100A, according to multiple feature of risk fields of multiple default feature of risk item configuration target databases, and will
Corresponding field in each feature of risk field and multiple data sources establishes mapping relations.
Each feature of risk field may map multiple fields of multiple data sources.
Step S100B extracts multiple originals of multiple client's samples according to the mapping relations from the multiple data source
Beginning data set.
Illustratively, when the same feature of risk field of same client's sample is corresponding with multiple risks in multiple data sources
When characteristic, according to the default weight coefficient of each data source, select the corresponding risk of the highest data source of weight coefficient special
Levy data.
Step S100C, cleaning lack the invalid raw data set of multiple feature of risk data, with from the multiple original
Effective multiple sample data sets are obtained in data set.
Step S102, it is corresponding that multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated
Multiple feature of risk fields in, obtain multiple feature of risk column corresponding with the multiple feature of risk field.
The corresponding sample data set of each client's sample, it includes multiple sample datas that each sample data, which is concentrated, often
A sample data is filled into corresponding feature of risk field, and multiple sample datas of the multiple sample data set can be with structure
At N number of characteristic series, N number of characteristic series correspond to N number of default feature of risk item, such as:
By a11, a21, a31... ... aM1Filling constitutes a feature of risk into the corresponding feature of risk field of a field name
Column;By a12, a22, a32... ... aM2Filling constitutes feature of risk column ... into the corresponding feature of risk field of a field name;
By a1N, a2N, a3N... ... aMNFilling constitutes feature of risk column into the corresponding feature of risk field of a field name.
Illustratively, by taking the corresponding first feature of risk field of field entitled " Age " as an example, a11Indicate first client's sample
Age, a21Indicate the age of first client's sample, a31Indicate the age of first client's sample,,, aM1Indicate M client's sample
This age, i.e., feature of risk column of M age composition one of M client's sample based on age information.
Each client's sample standard deviation includes N number of preparatory feature of risk item such as " age ", " data inflow data ".
Step S104 analyzes each feature of risk field corresponding feature of risk column, obtains each default risk
The WoE of characteristic item.
Illustratively, the formula for calculating the WoE of each default feature of risk item is as follows:
WoEiIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form coreiIndicate pre-
If feature of risk item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections;Pzi
Indicate that default feature of risk item i protects number in the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections
The ratio between amount;1≤i≤N.
With calculation risk characteristic series (a11, a21, a31... aM1) corresponding default feature of risk item (client age)
For WoE: WoEi(Weight ofEvidence, evidence weight) is a kind of mode that numerical value is done to sliding-model control, PyiTable
Show after characteristic series are carried out discrete processes, the high risk core of each age range protects the high risk core of quantity and whole age ranges
Protect ratio of number, it may be assumed that according to a11, a21, a31... aM1These age datas count specified age range (such as 20~25 years old)
Client's sample size for protecting of high risk core, and count client's sample size that the high risk core of whole age ranges is protected, so
The high risk of the high risk core of fetching dating section (such as 20~25 years old) is protected afterwards client's sample size and whole age ranges
The ratio between client's sample size that core is protected;PziIndicate that the non-high risk core of each age range protects quantity and whole ages
The non-high risk core in section protects ratio of number, it may be assumed that according to a11, a21, a31... aM1These age datas count the specified age
Client's sample size that the non-high risk core in section (such as 20~25 years old) is protected, and count the non-high risk core of whole age ranges
Client's sample size of guarantor, client's sample size that then the non-high risk core of fetching dating section (such as 20~25 years old) is protected
Ratio between client's sample size of the non-high risk core guarantor of whole age ranges.
WoEiNumerical value be that client based on each subinterval is distributed accounting situation and counts, each subinterval
WoEiThe high risk core guarantor quantity/non-high risk core guarantor's quantitative proportion for being each segmentation of reflection and overall high risk core protect number
Amount/non-high risk core protects the difference of quantitative proportion.
Step S106 passes through the multiple sample data set and the corresponding multiple WoE of the multiple default feature of risk item
More isolated trees of training, to obtain data exception detection model.
Isolated forest (iForest, isolation Forest) is made of more trees, and every tree is referred to as isolated tree
(iTree, isolation tree).If T is an isolated tree, T otherwise be a not no child node external node, or have
The internal node of two child nodes (Tl, Tr);It can specify a default feature of risk item and corresponding (that is, default risk is special
Levy item partition value), it presets feature of risk item partition value and is located between the maximum value and minimum value of corresponding default feature of risk item,
Multiple sample data sets are divided into Tl and Tr by default feature of risk item partition value.
In the exemplary embodiment, step S106 further comprises:
Step S106A, sampling with replacement obtains multiple client's sample sets from the multiple client's sample;
Isolated tree construction step: step S106B (1) selects one of client in the multiple client's sample set
Sample set;(2) it is selected one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to selected
Multiple client's samples in the client's sample set selected divide;(4) (2)~(4) are repeated the above steps until described isolated
Tree reaches the height limitation of setting;
Step S106B is repeated to obtain more isolated trees, the more isolated tree groups are combined into the data exception inspection
Survey model.
Embodiment two
Based on the trained data exception detection model of embodiment one, embodiment two is used to protect risk assessment mould to declaration form core
Type is trained.
Referring to Fig.2, showing the step flow chart of the data model training method of the embodiment of the present invention two.It is appreciated that
Flow chart in this method embodiment, which is not used in, is defined the sequence for executing step.It is specific as follows.
Step S200 obtains the second training dataset, and second training dataset includes multiple target customer's samples
Multiple target sample data sets.
Each target sample data set is input in the data exception detection model, passes through the number by step S202
The coefficient of variation of each target sample data set is exported according to abnormality detection model.
Target sample data set is traversed in each isolated tree, since the root node of every isolated tree, according to structure
The default feature of risk item and default feature of risk item partition value selected when building this isolated tree from root node to leaf node,
In, if the default feature of risk entry value for accordingly presetting feature of risk item in a certain isolated tree is less than default feature of risk item point
Value is cut, then target sample data set traverses left subtree, and otherwise target sample data set traverses right subtree, until reaching leaf
Node, and the quantity on the side passed by during this is recorded, obtain the path length of single isolated tree.According to target sample data
Collect the path length in each isolated tree, the coefficient of variation of target sample data set is calculated.
Illustratively, n target customer's sample, the n target visitor are randomly choosed in M given client's sample
Family sample corresponds to n target sample data set, and the coefficient of variation of each target sample data set is calculated by the following formula to obtain:
C (n)=2H (n-1)-(2 (n-1)/n)
H (k)=In (k)+ξ
X is used to identify one of target sample data set in n target sample data set, and s (x, n) is target sample
The coefficient of variation that data set x is obtained in the isolated tree being made of n target sample data set, c (n) are target sample data set
When number is n, the average path length of isolated tree, E (h (x)) is corresponding target sample data set x on the road of more isolated trees
Electrical path length mean value, h (x) are path length of the corresponding target sample data set x in each isolated tree, and k is root node to leaf
The path length of node, ξ are Euler Parameter, and H (k) is reconciliation number.
S (x, n) closer to 1, indicate the target sample data set be discrete abnormal point probability it is higher;S (x, n) is small
In 0.5, indicate that the target sample data set is normal data set.
Step S204 is concentrated from second training data and is screened according to the coefficient of variation of each target sample data set
Obtain third training dataset.
The abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold.
Step S206 protects risk evaluation model to declaration form core and is trained according to the third training dataset.
The declaration form core protects risk evaluation model, can be LR (loss function, logistic regression) model, FM
(Factorization Machine, Factorization machine) model, depth network neural model or the combination of above-mentioned model.
Embodiment three
Please continue to refer to Fig. 3, the program module schematic diagram of data model training system embodiment three of the present invention is shown.In
In the present embodiment, data model training system 20 may include or be divided into one or more program modules, and one or more
A program module is stored in storage medium, and as performed by one or more processors, to complete the present invention, and can be realized
Above-mentioned data model training method.The so-called program module of the embodiment of the present invention is a series of meters for referring to complete specific function
Calculation machine program instruction section, the execution than program itself more suitable for description data model training system 20 in storage medium
Journey.The function of each program module of the present embodiment will specifically be introduced by being described below:
Module 200 is obtained, for obtaining the first training dataset, first training dataset includes multiple client's samples
Multiple sample data sets.
In the exemplary embodiment, the acquisition module 200, is used for: configuring mesh according to multiple default feature of risk items
Multiple feature of risk fields of database are marked, and each feature of risk field and the corresponding field foundation in multiple data sources are reflected
Penetrate relationship;According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source;Clearly
The invalid raw data set for lacking multiple feature of risk data is washed, to concentrate acquisition effective more from the multiple initial data
A sample data set.
Wherein, according to the mapping relations, multiple original numbers of multiple client's samples are extracted from the multiple data source
According to collection, comprising: when the same feature of risk field of same client's sample is corresponding with multiple feature of risk numbers in multiple data sources
According to when, according to the default weight coefficient of each data source, select the corresponding feature of risk data of the highest data source of weight coefficient.
Logging modle 202, it is special that multiple default risks are recorded in multiple sample datas for concentrating each sample data
It levies in the corresponding multiple feature of risk fields of item, obtains multiple feature of risk column corresponding with the multiple feature of risk field.
In the exemplary embodiment,
Analysis module 204 obtains each pre- for analyzing each feature of risk field corresponding feature of risk column
If the WoE of feature of risk item.
In the exemplary embodiment, the formula for calculating the WoE of each default feature of risk item is as follows:
WoEiIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form coreiIndicate pre-
If feature of risk item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections;Pzi
Indicate that default feature of risk item i protects number in the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections
The ratio between amount.
Abnormality detection model training module 206, for passing through the multiple sample data set and the multiple default risk
Characteristic item more isolated trees of corresponding multiple WoE training, to obtain data exception detection model.
In the exemplary embodiment, the abnormality detection model training module 206, is also used to: from the multiple client
Sampling with replacement obtains multiple client's sample sets in sample;Isolated tree construction step: (1) the multiple client's sample is selected
The one of client's sample set concentrated;(2) according to the multiple WoE select by one of them preset feature of risk item and
Corresponding value, (3) divide multiple client's samples in the client's sample set selected;(4) it repeats the above steps
(2)~(4) are until the isolated tree reaches the height limitation of setting;The isolated tree construction step is repeated to obtain more
Isolated tree, the more isolated tree groups are combined into the data exception detection model.
In the exemplary embodiment, the system 20 further includes risk evaluation model training module 208, is used for: being obtained
Second training dataset, second training dataset include multiple target sample data sets of multiple target customer's samples;It will
Each target sample data set is input in the data exception detection model, is exported by the data exception detection model every
The coefficient of variation of a target sample data set;According to the coefficient of variation of each target sample data set, from the second training number
According to concentrating screening to obtain third training dataset, the abnormal of each target sample data set that the third training data is concentrated is
Number is all larger than preset threshold;According to the third training dataset, risk evaluation model is protected to declaration form core and is trained.
The coefficient of variation of each target sample data set is calculated by the following formula to obtain:
C (n)=2H (n-1)-(2 (n-1)/n)
H (k)=In (k)+ξ
X is used to identify one of target sample data set in n target sample data set;S (x, n) is target sample
The coefficient of variation that data set x is obtained in the isolated tree being made of n target sample data set;C (n) is target sample data set
When number is n, the average path length of isolated tree;E (h (x)) is corresponding target sample data set x on the road of more isolated trees
Electrical path length mean value;H (x) is path length of the corresponding target sample data set x in each isolated tree;K is root node to leaf
The path length of node;ξ is Euler Parameter;H (k) is reconciliation number.
Example IV
It is the hardware structure schematic diagram of the computer equipment of the embodiment of the present invention four refering to Fig. 4.It is described in the present embodiment
Computer equipment 2 is that one kind can be automatic to carry out numerical value calculating and/or information processing according to the instruction for being previously set or storing
Equipment.The computer equipment 2 can be rack-mount server, blade server, tower server or Cabinet-type server
(including server cluster composed by independent server or multiple servers) etc..As shown, the computer equipment
2 include at least, but are not limited to, can be in communication with each other by system bus connection memory 21, processor 22, network interface 23, with
And data model training system 20.Wherein:
In the present embodiment, memory 21 includes at least a type of computer readable storage medium, the readable storage
Medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device
(RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory
(EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..In some embodiments, memory
21 can be the internal storage unit of computer equipment 2, such as the hard disk or memory of the computer equipment 2.In other implementations
In example, memory 21 is also possible to the grafting being equipped on the External memory equipment of computer equipment 2, such as the computer equipment 20
Formula hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card
(Flash Card) etc..Certainly, memory 21 can also both including computer equipment 2 internal storage unit and also including outside it
Store equipment.In the present embodiment, memory 21 is installed on the operating system and types of applications of computer equipment 2 commonly used in storage
Software, for example, embodiment three data model training system 20 program code etc..In addition, memory 21 can be also used for temporarily
Ground stores the Various types of data that has exported or will export.
Processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU),
Controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in control computer equipment 2
Overall operation.In the present embodiment, program code or processing data of the processor 22 for being stored in run memory 21,
Such as operation data model training systems 20, to realize the data model training method of embodiment one or two.
The network interface 23 may include radio network interface or wired network interface, which is commonly used in
Communication connection is established between the computer equipment 2 and other electronic devices.For example, the network interface 23 is for passing through network
The computer equipment 2 is connected with exterior terminal, establishes data transmission between the computer equipment 2 and exterior terminal
Channel and communication connection etc..The network can be intranet (Intranet), internet (Internet), whole world movement
Communication system (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband
Code Division Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), the nothings such as Wi-Fi
Line or cable network.
It should be pointed out that Fig. 4 illustrates only the computer equipment 2 with component 20-23, it should be understood that simultaneously
All components shown realistic are not applied, the implementation that can be substituted is more or less component.
In the present embodiment, the data model training system 20 being stored in memory 21 can also be divided into one
A or multiple program modules, one or more of program modules are stored in memory 21, and by one or more
Processor (the present embodiment is processor 22) is performed, to complete the present invention.
For example, Fig. 3 shows the program module schematic diagram for realizing 20 embodiment three of data model training system, the reality
Apply in example, the data model training system 20 can be divided into obtain module 200, logging modle 202, analysis module 204,
Training module 206 and risk evaluation model training module 208.Wherein, the so-called program module of the present invention is to refer to complete spy
The series of computation machine program instruction section for determining function, than program more suitable for describing the data model training system 20 described
Implementation procedure in computer equipment 2.The concrete function of described program module 200-208 has in the third embodiment to be retouched in detail
It states, details are not described herein.
Embodiment five
The present embodiment also provides a kind of computer readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory
(for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory
(ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, magnetic
Disk, CD, server, App are stored thereon with computer program, phase are realized when program is executed by processor using store etc.
Answer function.The computer readable storage medium of the present embodiment model training systems 20 for storing data, when being executed by processor
Realize the data model training method of embodiment one or two.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of data model training method, which is characterized in that the described method includes:
The first training dataset is obtained, first training dataset includes multiple sample data sets of multiple client's samples;
It is special that the corresponding multiple risks of multiple default feature of risk items are recorded in multiple sample datas that each sample data is concentrated
It levies in field, obtains multiple feature of risk column corresponding with the multiple feature of risk field;
Each feature of risk field corresponding feature of risk column are analyzed, the WoE of each default feature of risk item is obtained;And
It is isolated by the multiple sample data set and the multiple default feature of risk item corresponding multiple WoE training more
Tree, to obtain data exception detection model.
2. data model training method as described in claim 1, which is characterized in that the step of obtaining the first training dataset,
Include:
According to multiple feature of risk fields of multiple default feature of risk items configuration target databases, and by each feature of risk word
Section establishes mapping relations with the corresponding field in multiple data sources;
According to the mapping relations, multiple raw data sets of multiple client's samples are extracted from the multiple data source;And
Cleaning lacks the invalid raw data set of multiple feature of risk data, has to concentrate to obtain from the multiple initial data
The multiple sample data set of effect.
3. data model training method as claimed in claim 2, which is characterized in that according to the mapping relations, from described more
The step of multiple raw data sets of multiple client's samples are extracted in a data source, comprising:
When the same feature of risk field of same client's sample is corresponding with multiple feature of risk data in multiple data sources, root
According to the default weight coefficient of each data source, the corresponding feature of risk data of the highest data source of weight coefficient are selected.
4. declaration form core as claimed in claim 3 protects model training method, which is characterized in that calculate each default feature of risk item
WoE formula it is as follows:
WoEiIndicate that default feature of risk item i value protects the influence coefficient of risk evaluation result, Py to declaration form coreiIndicate default wind
Dangerous characteristic item i protects ratio of number in the high risk core that the high risk core of each value interval protects quantity and whole sections;PziIt indicates
Default feature of risk item i the non-high risk core that the non-high risk core of each value interval protects quantity and whole sections protect quantity it
Than.
5. data model training method as claimed in claim 4, which is characterized in that pass through the multiple sample data set and institute
Multiple default feature of risk items more isolated trees of corresponding multiple WoE training are stated, the step of to obtain data exception detection model,
Include:
Sampling with replacement obtains multiple client's sample sets from the multiple client's sample;
Isolated tree construction step: (1) one of client's sample set in the multiple client's sample set is selected;(2) root
It selects one of them presetting feature of risk item and corresponding value according to the multiple WoE, (3) are to client's sample selected
The multiple client's samples concentrated divide;(4) (2)~(4) are repeated the above steps until the isolated tree reaches the height of setting
Degree limitation;
The isolated tree construction step is repeated to obtain more isolated trees, it is different that the more isolated tree groups are combined into the data
Normal detection model.
6. data model coaching method as claimed in claim 5, which is characterized in that further include:
The second training dataset is obtained, second training dataset includes multiple target sample numbers of multiple target customer's samples
According to collection;
Each target sample data set is input in the data exception detection model, the data exception detection model is passed through
Export the coefficient of variation of each target sample data set;
According to the coefficient of variation of each target sample data set, screening is concentrated to obtain third training number from second training data
According to collection, the abnormal coefficient for each target sample data set that the third training data is concentrated is all larger than preset threshold;
According to the third training dataset, risk evaluation model is protected to declaration form core and is trained.
7. data model training method as claimed in claim 6, which is characterized in that the variation lines of each target sample data set
Number is calculated by the following formula to obtain:
C (n)=2H (n-1)-(2 (n-1)/n)
H (k)=In (k)+ξ
X is used to identify one of target sample data set in n target sample data set;S (x, n) is target sample data
The coefficient of variation that collection x is obtained in the isolated tree being made of n target sample data set;C (n) is the number of target sample data set
When for n, the average path length of isolated tree;E (h (x)) is path length of the corresponding target sample data set x in more isolated trees
Spend mean value;H (x) is path length of the corresponding target sample data set x in each isolated tree;K is root node to leaf node
Path length;ξ is Euler Parameter;H (k) is reconciliation number.
8. a kind of data model training system characterized by comprising
Module is obtained, for obtaining the first training dataset, first training dataset includes the multiple of multiple client's samples
Sample data set;
Logging modle, it is corresponding that multiple default feature of risk items are recorded in multiple sample datas for concentrating each sample data
Multiple feature of risk fields in, obtain multiple feature of risk column corresponding with the multiple feature of risk field;
Analysis module obtains each default risk for analyzing each feature of risk field corresponding feature of risk column
The WoE of characteristic item;And
Abnormality detection model training module, for right by the multiple sample data set and the multiple default feature of risk item
The multiple WoE more isolated trees of training answered, to obtain data exception detection model.
9. a kind of computer equipment, the computer equipment memory, processor and it is stored on the memory and can be in institute
State the computer program run on processor, which is characterized in that such as right is realized when the computer program is executed by processor
It is required that described in any one of 1 to 7 the step of data model training method.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium
Program, the computer program can be performed by least one processors, so that at least one described processor executes such as right
It is required that described in any one of 1 to 7 the step of data model training method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910665010.XA CN110517154A (en) | 2019-07-23 | 2019-07-23 | Data model training method, system and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910665010.XA CN110517154A (en) | 2019-07-23 | 2019-07-23 | Data model training method, system and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110517154A true CN110517154A (en) | 2019-11-29 |
Family
ID=68623317
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910665010.XA Pending CN110517154A (en) | 2019-07-23 | 2019-07-23 | Data model training method, system and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110517154A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111090685A (en) * | 2019-12-19 | 2020-05-01 | 第四范式(北京)技术有限公司 | Method and device for detecting data abnormal characteristics |
CN112560938A (en) * | 2020-12-11 | 2021-03-26 | 上海哔哩哔哩科技有限公司 | Model training method and device and computer equipment |
CN113065610A (en) * | 2019-12-12 | 2021-07-02 | 支付宝(杭州)信息技术有限公司 | Isolated forest model construction and prediction method and device based on federal learning |
CN113516513A (en) * | 2021-07-20 | 2021-10-19 | 重庆度小满优扬科技有限公司 | Data analysis method and device, computer equipment and storage medium |
CN114219026A (en) * | 2021-12-15 | 2022-03-22 | 中兴通讯股份有限公司 | Data processing method and device and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868878A (en) * | 2015-01-21 | 2016-08-17 | 阿里巴巴集团控股有限公司 | Method and device for MAC (Media Access Control) address risk identification |
CN108549954A (en) * | 2018-03-26 | 2018-09-18 | 平安科技(深圳)有限公司 | Risk model training method, risk identification method, device, equipment and medium |
CN108985632A (en) * | 2018-07-16 | 2018-12-11 | 国网上海市电力公司 | A kind of electricity consumption data abnormality detection model based on isolated forest algorithm |
CN109345137A (en) * | 2018-10-22 | 2019-02-15 | 广东精点数据科技股份有限公司 | A kind of rejecting outliers method based on agriculture big data |
CN109948738A (en) * | 2019-04-11 | 2019-06-28 | 合肥工业大学 | Energy consumption method for detecting abnormality, the apparatus and system of coating drying room |
-
2019
- 2019-07-23 CN CN201910665010.XA patent/CN110517154A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868878A (en) * | 2015-01-21 | 2016-08-17 | 阿里巴巴集团控股有限公司 | Method and device for MAC (Media Access Control) address risk identification |
CN108549954A (en) * | 2018-03-26 | 2018-09-18 | 平安科技(深圳)有限公司 | Risk model training method, risk identification method, device, equipment and medium |
CN108985632A (en) * | 2018-07-16 | 2018-12-11 | 国网上海市电力公司 | A kind of electricity consumption data abnormality detection model based on isolated forest algorithm |
CN109345137A (en) * | 2018-10-22 | 2019-02-15 | 广东精点数据科技股份有限公司 | A kind of rejecting outliers method based on agriculture big data |
CN109948738A (en) * | 2019-04-11 | 2019-06-28 | 合肥工业大学 | Energy consumption method for detecting abnormality, the apparatus and system of coating drying room |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113065610A (en) * | 2019-12-12 | 2021-07-02 | 支付宝(杭州)信息技术有限公司 | Isolated forest model construction and prediction method and device based on federal learning |
CN113065610B (en) * | 2019-12-12 | 2022-05-17 | 支付宝(杭州)信息技术有限公司 | Isolated forest model construction and prediction method and device based on federal learning |
CN111090685A (en) * | 2019-12-19 | 2020-05-01 | 第四范式(北京)技术有限公司 | Method and device for detecting data abnormal characteristics |
CN111090685B (en) * | 2019-12-19 | 2023-08-22 | 第四范式(北京)技术有限公司 | Method and device for detecting abnormal characteristics of data |
CN112560938A (en) * | 2020-12-11 | 2021-03-26 | 上海哔哩哔哩科技有限公司 | Model training method and device and computer equipment |
CN112560938B (en) * | 2020-12-11 | 2023-08-25 | 上海哔哩哔哩科技有限公司 | Model training method and device and computer equipment |
CN113516513A (en) * | 2021-07-20 | 2021-10-19 | 重庆度小满优扬科技有限公司 | Data analysis method and device, computer equipment and storage medium |
CN114219026A (en) * | 2021-12-15 | 2022-03-22 | 中兴通讯股份有限公司 | Data processing method and device and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110516910A (en) | Declaration form core based on big data protects model training method and core protects methods of risk assessment | |
CN110517154A (en) | Data model training method, system and computer equipment | |
CN108446281B (en) | Method, device and storage medium for determining user intimacy | |
Martínez‐Meyer et al. | Conservatism of ecological niche characteristics in North American plant species over the Pleistocene‐to‐Recent transition | |
CN110196834A (en) | It is a kind of for data item, file, database to mark method and system | |
CN111159428A (en) | Method and device for automatically extracting event relation of knowledge graph in economic field | |
CN111179055B (en) | Credit line adjusting method and device and electronic equipment | |
CN115547466B (en) | Medical institution registration and review system and method based on big data | |
CN112016793B (en) | Resource allocation method and device based on target user group and electronic equipment | |
CN112508456A (en) | Food safety risk assessment method, system, computer equipment and storage medium | |
Barankin et al. | Evidence-driven approach for assessing social vulnerability and equality during extreme climatic events | |
CN115936895A (en) | Risk assessment method, device and equipment based on artificial intelligence and storage medium | |
Bateman et al. | The The Supervised Learning Workshop: A New, Interactive Approach to Understanding Supervised Learning Algorithms | |
CN115495711A (en) | Natural disaster insurance prediction processing method and system and electronic equipment | |
CN114493853A (en) | Credit rating evaluation method, credit rating evaluation device, electronic device and storage medium | |
CN117934154A (en) | Transaction risk prediction method, model training method, device, equipment, medium and program product | |
CN110837604B (en) | Data analysis method and device based on housing monitoring platform | |
CN112446777B (en) | Credit evaluation method, device, equipment and storage medium | |
CN112100165B (en) | Traffic data processing method, system, equipment and medium based on quality assessment | |
CN114581219A (en) | Anti-telecommunication network fraud early warning method and system | |
CN113255806A (en) | Sample feature determination method, sample feature determination device and electronic equipment | |
CN114372867A (en) | User credit verification and evaluation method and device and computer equipment | |
CN113436023A (en) | Financial product recommendation method and device based on block chain | |
CN113379212A (en) | Block chain-based logistics information platform default risk assessment method, device, equipment and medium | |
Phuong et al. | Multiple trend tests on air temperature and precipitation anomalies in Vietnam |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191129 |
|
RJ01 | Rejection of invention patent application after publication |