CN107092919A - A kind of user's sample characteristics optimized treatment method and device - Google Patents
A kind of user's sample characteristics optimized treatment method and device Download PDFInfo
- Publication number
- CN107092919A CN107092919A CN201610091834.7A CN201610091834A CN107092919A CN 107092919 A CN107092919 A CN 107092919A CN 201610091834 A CN201610091834 A CN 201610091834A CN 107092919 A CN107092919 A CN 107092919A
- Authority
- CN
- China
- Prior art keywords
- sample
- user
- feature
- interval
- mrow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
This application discloses a kind of user's sample characteristics optimized treatment method, for user's sample characteristics optimization processing, and then the characteristic value constructed is set more to fit the variation tendency of positive sample concentration.This method includes:Determine to include positive sample in the feature to be optimized of user's sample in user's sample set, user's sample set;User's sample in user's sample set is divided into by N+1 interval with predetermined N number of quantile according to the value of the feature of each user's sample, N is the positive integer more than 1;It is interval for N+1 each in interval, calculate the ratio of overall user's sample size between the quantity occupied area of positive sample in each interval;The ratio calculated in each interval is defined as to the new value of the feature of each user's sample in the interval.Disclosed herein as well is a kind of user's sample characteristics optimization processing device.
Description
Technical field
The application is related to field of computer technology, more particularly to a kind of user's sample characteristics optimized treatment method and
Device.
Background technology
With continuing to develop for information technology, step at present the big data epoch, businessman or enterprise etc. can lead to
The various service platforms for crossing its offer are collected into mass users sample, generally have in these user's samples a lot
The amount of money of feature, such as the user purchase and consumption on network, the record of goods return and replacement buys financial investment product
The amount of money, tightness degree of relation etc. between user A and user B passes through the spy to these user's samples
Levy and handled, and then input model is trained, point of new user behavior can be predicted by finally giving
Class model.Draw after disaggregated model, new user's sample is inputted into above-mentioned disaggregated model, warp by processing
Cross model and calculate and user's sample can be predicted, for example, predicting the user for credit is good or credit
It is poor etc..
It is typically that characteristic value is handled to obtain this feature when handling the feature of user's sample
New value, conventional processing method is max min facture at present, and its step is as follows:The first step,
Count the maxima and minima of user's sample characteristically;Second step, will using max min method
The value of this feature of each user's sample is handled, and the new span of feature thus has been mapped into 0
To between 1.
Using above-mentioned max min facture to the processing of user's sample characteristics, easily make feature after processing
New value can not fit the variation tendency of positive sample concentration, being finally likely to result in model can not in training
Learn the linear rule of this feature well, so as to reduce the results of learning of model, cause the prediction of model
Precise decreasing.
The content of the invention
Based on above-mentioned technical problem, the embodiment of the present application provide a kind of user's sample characteristics optimized treatment method and
Device, for user's sample characteristics optimization processing, and then makes the characteristic value constructed more fit positive sample
The variation tendency of concentration.
The embodiment of the present application uses following technical proposals:
A kind of user's sample characteristics optimized treatment method, including:Determine treating for user's sample in user's sample set
Optimize in feature, user's sample set and include positive sample;According to the feature of each user's sample
User's sample in user's sample set is divided into N+1 interval, N by value with predetermined N number of quantile
For the positive integer more than 1;It is interval for N+1 each in interval, calculate in each interval just
The ratio of overall user's sample size between the quantity occupied area of sample;The ratio calculated in each interval is true
It is set to the new value of the feature of each user's sample in the interval.
Preferably, the ratio calculated in each interval is defined as in the interval described in each user's sample
After the new value of feature, methods described also includes:New value to the feature of user's sample is returned
One change is handled.
Preferably, the new value to the feature of user's sample is normalized, and specifically includes:Really
Maxima and minima in the new value of the fixed feature;To each new value in the feature,
Handled as follows, using the numerical value after processing as the feature value:
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature,
Fmax、FminMaxima and minima in the respectively described new value of feature.
Preferably, the ratio calculated in each interval is defined as in the interval described in each user's sample
Before the new value of feature, methods described also includes:Select it is each it is interval in ratio and described predetermined point
The feature of linear relationship is unsatisfactory between preset value determined by site.
Preferably, according to the value of the feature of each user's sample with predetermined N number of quantile by user
User's sample in sample set is divided into N+1 interval, specifically includes:According to each user's sample
The value of feature is ranked up;Using corresponding value on N number of quantile as border, by user's sample set
User's sample be divided into N+1 it is interval.
Preferably, after the new value of the feature of user's sample is normalized, the side
Method also includes, and user's sample input linear model after processing is trained.
A kind of user's sample characteristics optimization processing device, including:Characteristic determination module, interval division module,
Ratio calculation module and characteristic value determining module, wherein:The characteristic determination module, for determining user's sample
Include positive sample in the feature to be optimized of user's sample, user's sample set in this collection;Described interval stroke
Sub-module, the value for the feature according to each user's sample is with predetermined N number of quantile by user
User's sample in sample set is divided into N+1 interval, and N is the positive integer more than 1;The ratio calculation
Module, for interval for N+1 each in interval, calculates the number of positive sample in each interval
The ratio of overall user's sample size between amount occupied area;The characteristic value determining module, for each is interval
In the ratio that calculates be defined as the new value of the feature of each user's sample in the interval.
Preferably, described device also includes normalization module, wherein:The normalization module, for
The new value of the feature of family sample is normalized.
Preferably, the normalization module specifically includes determination subelement and processing subelement, wherein:It is described
Determination subelement, the maxima and minima in new value for determining the feature;Processing is single
Member, for each new value in the feature, being handled as follows, after processing
Numerical value as the feature value:
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature,
Fmax、FminMaxima and minima in the respectively described new value of feature.
Preferably, described device also includes model training module, wherein:The model training module, is used for
User's sample input linear model after processing is trained.
At least one above-mentioned technical scheme that the embodiment of the present application is used can reach following beneficial effect:It will use
After user's sample demarcation interval in the sample set of family, calculate whole between the quantity occupied area of positive sample in each interval
The ratio of body user's sample size, takes using the ratio that calculates as the new of the feature of each user's sample in interval
Value, the new value of such feature can fit the growth or downward trend of positive sample concentration, may finally make mould
Type is trained up to this feature.Meanwhile, each user's sample characteristics in interval are used as using the ratio that calculates
New value, linear relationship is met between the new value of concentration and feature that positive sample can be made, while solve again
The problem of linear relationship being unsatisfactory between the concentration and feature new value of positive sample.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes one of the application
Point, the schematic description and description of the application is used to explain the application, does not constitute to the application not
Work as restriction.In the accompanying drawings:
The implementation process signal for user's sample characteristics optimized treatment method that Fig. 1 provides for the embodiment of the present application
Figure;
Preset determined by ratio and predetermined quantile in each interval that Fig. 2 provides for the embodiment of the present application
Linear relationship schematic diagram is met between value;
Preset determined by ratio and predetermined quantile in each interval that Fig. 3 provides for the embodiment of the present application
Linear relationship schematic diagram is unsatisfactory between value;
The structured flowchart for user's sample characteristics optimization processing device that Fig. 4 provides for the embodiment of the present application.
Embodiment
It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer
Apply example and technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, it is described
Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application
Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made
Example, belongs to the scope of the application protection.
The implementation process signal for user's sample characteristics optimized treatment method that Fig. 1 provides for the embodiment of the present application
Figure, mainly including following steps:
Step 11:Determine the feature to be optimized of user's sample in user's sample set;
User's sample is usually the historical data related to user collected, such as user basic information, silver
Row accounts information, social information, hobby preference, use information, payment information, debt information etc., here
Historical data specifically include multiple features, and these features are often embodied with the form of numerical value.With
Exemplified by personal reference business, the feature included by the historical data of user's sample is done shopping for user on network to disappear
The specific amount of money taken, the specific number of times of goods return and replacement buys the specific amount of money of financial investment product, credit card
Refund situation, amount owed of credit card etc. in certain time.As can be seen here, user's sample is related to very
Multiple features, therefore first have to determine that a certain feature either certain several feature come out point from user's sample
Analysis, can be by each feature in multiple features respectively by following if once determining multiple features
Step 12,13,14 are operated.
It is typically user's sample in user's sample set in user's sample set for above-mentioned user's sample
Quantity can set according to the actual needs, for example can be to be thousands of to tens of thousands of, general user's sample
Quantity is more more more can more accurately reflect actual conditions, but either construction Feature Engineering either carries out model
During training, it calculates and handled accordingly situation that the then time then can be more than user's sample size when less.
In addition, including positive sample in user's sample here.Wherein, positive sample can be by label
The sample of change, namely positive sample are that its belonging kinds is by hand labeled either computer identification mark
Know.The poor sample of credit in user's sample is for example subjected to labeling, and the poor user of credit here
Sample is then referred to as positive sample, accordingly can be with creditable preferably anti-sample etc..
Step 12:It will be used with predetermined N number of quantile according to the value of the feature of each user's sample
User's sample in the sample set of family is divided into N+1 interval;
Feature in user's sample is often embodied with the form of numerical value, and the quantity of user's sample is more,
Can according to the value of the feature of the determination of each user's sample with predetermined N number of quantile by user's sample
User's sample in collection is divided into N+1 interval, and wherein N is positive integer.For example, determining user's sample
This feature to be optimized is spending amount, first can be by all user's samples in specific demarcation interval
Spending amount is ranked up according to order from small to large, the pattern after sequence can for (100,202 ...,
25000,30000) etc., it is necessary to which explanation, the value of each feature after sequence should also be with original subscriber
Sample is corresponding;After having sorted, according to the size of the spending amount of each user's sample with predetermined N
User's sample in user's sample set is divided into N+1 interval by individual quantile.
For above-mentioned quantile, citing is illustrated, and such as one has 100 samples, and sample is divided into
10 intervals, can after sequence the 10th, 20 ... the number on 90 is used as quantile.
In addition, in above-mentioned N+1 is interval, the quantity of user's sample in each interval can be equalization,
The sample of such as user is 100 altogether, correspondingly, has 100 here by the number after sorting from small to large
According to, above-mentioned data can be divided into by 10 intervals according to predetermined N number of quantile, i.e., it is each it is interval in
There are 10 user's samples.It is certainly each in addition to the quantity of user's sample in each interval is equalization
The quantity of user's sample in individual interval can also be distributed according to a certain percentage, and the present embodiment is not made to this
Limitation.
Step 13:It is interval for N+1 each in interval, calculate positive sample in each interval
Quantity occupied area between overall user's sample size ratio;
As it was previously stated, user's sample includes the positive sample of labeling, the processing by step 12 will be used
Family sample is divided into after N+1 interval, then can be interval for each, calculate the interval just
The quantity of sample accounts for the ratio of the overall user's sample size in the interval.Such as a certain interval user's sample size
10 are total up to, and the quantity of positive sample is 3, then the quantity for calculating positive sample in the interval accounts for this
The ratio 0.3 of interval entirety user's sample size.
It should be noted that referred to herein as positive sample and overall user's sample size, in overall user's sample
In quantity, other samples in addition to positive sample can be with all anti-samples;Certainly can also be part
The sample of unknown classification, the anti-sample in part;It can also be the sample of entirely unknown classification.For example by user
Credit be divided into have a good credit it is poor with credit, credit difference for positive sample, have a good credit for anti-sample, used overall
Other samples in the sample size of family in addition to the poor sample of credit, can have a good credit be anti-with all
Sample;Certainly can also be the sample of the unknown classification in part, the partly anti-sample to have a good credit;Can also be
The sample of entirely unknown credit category.
Step 14:The ratio calculated in each interval is defined as in the interval described in each user's sample
The new value of feature.
Calculated in step 13 in N+1 interval, between the quantity occupied area of the positive sample in each interval
The ratio of overall user's sample size, has calculated N+1 ratio accordingly, and this step then can be for upper
Each in the N+1 stated is interval, and the ratio calculated in the interval is defined as into each user in the interval
The new value of the feature of sample.Also by before for an example exemplified by, a certain interval user's sample
This quantity is total up to 10, and the quantity for calculating positive sample in the interval accounts for the overall user's sample number in the interval
The ratio 0.3 of amount, then can be defined as 0.3, separately by the new value of the feature of the 10 user's samples in interval
The outer operation that this step is performed both by for each interval.
The above-mentioned technical proposal that the embodiment of the present application is used, by user's sample demarcation interval in user's sample set
Afterwards, the ratio of overall user's sample size between the quantity occupied area of positive sample in each interval is calculated, to calculate
The ratio gone out is as the new value of the feature of each user's sample in interval, and the new value of such feature can be use up can
The growth or downward trend of the laminating positive sample concentration of energy, may finally make model fully instruct this feature
Practice, Optimized model fitting degree.Meanwhile, each user's sample characteristics in interval are used as using the ratio that calculates
Linear relationship is met between new value, the new value of concentration and feature that positive sample can be made, while solve again
The problem of linear relationship being unsatisfactory between the new value of concentration and feature of positive sample.
After the step 14 of above-described embodiment, above-described embodiment can also comprise the following steps:To user's sample
The new value of this feature is normalized., can be using linear when being normalized
Maximin method, averaging method or median method in function method etc..Maximin method can be by feature
Value is normalized in the range of [0,1];Averaging method can normalize to the value of feature in any range,
But the symbol of maxima and minima can not change simultaneously;Median method can normalize to the value of feature
In the range of [- 1,1].Certainly can also be other normalization algorithms.
By this normalized, the most value at last after normalized is limited in the range of needs, main
If convergence rate when finally entering model training for the convenience of subsequent data processing and guarantee is accelerated.
Following method for normalizing can specifically be used:
The first step:Determine the maxima and minima in the new value of the feature;
Second step:To each new value in the feature, handled, will be handled as follows
Numerical value afterwards as the feature value:
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature,
Fmax、FminMaxima and minima in the respectively described new value of feature.
As it was previously stated, generally including many features in user's sample, the new value to all features is carried out
After normalized, embodiment of the method before can also comprise the following steps:By user's sample after processing
This input linear model is trained.As it was previously stated, including many features in user's sample, here may be used
To pick out Partial Feature all by the step 11 of above-described embodiment, step 12, step 13 and step 14
After processing and then it is normalized, finally enters linear model and be trained.And then make model to upper
State the feature selected to be trained up, Optimized model fitting degree, most lift scheme estimated performance at last
With effect.
Before the step 14 of above-described embodiment, methods described embodiment can also comprise the following steps:Choosing
Linear pass is unsatisfactory between preset value determined by ratio and the predetermined quantile in each interval of taking-up
The feature of system.Whether the ratio that linear relationship here is referred mainly in each interval is a little identified pre- with being divided into
If the increase of value and monotonic increase, or be it is each it is interval in ratio whether be divided into a little determined by advance
If value increase and monotone decreasing.For preset value determined by N number of quantile, the preset value can be
To be divided into the interval a little divided for foundation, it can also directly be divided into specific value a little, or with
Quantile is in a certain numerical value of certain proportion relation etc..
Specifically, when selecting the feature for being unsatisfactory for linear relationship, it can first determine whether in each interval
Whether linear relationship is met between the preset value that ratio and predetermined N number of quantile are determined respectively, sentenced
Step 11 can be combined when disconnected, the result of step 12 and step 13, can be with for convenience of judging
Draw the Bi-Var curve maps of new feature according to above-mentioned result, and according to figure whether have monotonicity come
Judge whether meet linear relationship between ratio and predetermined N number of quantile in each interval.
Fig. 2 and Fig. 3 is the result of step 12 and step 13 according to step 11, draws new feature
Bi-Var curve maps, wherein the abscissa of curve for using be divided into a little determined by preset value by according to being divided
Interval, ordinate accounts for the ratio of interval entirety user's sample size for the quantity of the positive sample in each interval
Value, picture here simply facilitates understandings certainly, in practice abscissa interval between be probably it is continuous,
It is also likely to be discrete.As can be seen that Fig. 2 Bi-Var curve maps are to meet linear relationship, i.e. ordinate
Numerical value be increased monotonically with the numerical value of abscissa, and the Bi-Var curve maps shown in Fig. 3 is are unsatisfactory for line
Sexual intercourse.Therefore before being normalized at step 14, can select the judged result is
Feature under conditions of no, i.e., only carry out the processing of step 14 to the characteristic value for being unsatisfactory for linear relationship.
In the step 12 of above-described embodiment, according to the value of the feature of each user's sample with predetermined
User's sample in user's sample set is divided into N+1 interval by N number of quantile, can specifically be used as follows
Method:The first step:It is ranked up according to the value of the feature of each user's sample;Second step:By N
User's sample in user's sample set is divided into N+1 area by corresponding value as border on individual quantile
Between.For example there are 100 user's samples, being ordered as the 10th, 20 ..., 9 numbers on 90 are as pre-
Determine quantile, user's sample is divided into 10 intervals by border of this numerical value.The 20th can also will be ordered as,
User's sample is divided into 5 intervals by 4 numbers on 40 ... 80 as quantile by border of this numerical value.Separately
Outside, the quantity of user's sample can be with equal in any two interval in above-mentioned N+1 interval.
Above-mentioned several embodiments are all the present processes embodiment, correspondingly, and the application also provides a kind of use
Family sample characteristics optimization processing device embodiment, for user's sample characteristics optimization processing, and then makes construction
The characteristic value gone out is more fitted the variation tendency of positive sample concentration.Fig. 4 is specifically shown in, including:Feature determines mould
Block 21, interval division module 22, ratio calculation module 23 and characteristic value determining module 24, wherein:
The characteristic determination module 21, is determined for the spy to be optimized of user's sample in user's sample set
Levy, include positive sample in user's sample set;
The interval division module 22, can be used for the value according to the feature of each user's sample with pre-
User's sample in user's sample set is divided into N+1 interval by fixed N number of quantile, and N is more than 1
Positive integer;
The ratio calculation module 23, can be used for, for N+1 each interval in interval, counting
Calculate the ratio of overall user's sample size between the quantity occupied area of positive sample in each interval;
The characteristic value determining module 24, can be used for the ratio calculated in each interval being defined as this
The new value of the feature of each user's sample in interval.
When present apparatus embodiment works, characteristic determination module determines feature to be optimized first, then interval to draw
User's sample in user's sample set is divided into N+1 interval by sub-module, and ratio calculation module calculates each
In interval between the quantity occupied area of positive sample overall user's sample size ratio, last characteristic value determining module will
The ratio calculated in each interval is defined as the new value of the feature of each user's sample in the interval.
So new value of feature can fit the growth or downward trend of positive sample concentration completely, may finally make mould
Type is trained up to this feature, Optimized model fitting degree.Meanwhile, area is used as using the ratio that calculates
Line is met between the new value of interior each user's sample characteristics, the new value of concentration and feature that positive sample can be made
Sexual intercourse, while being unsatisfactory for asking for linear relationship between solving the new value of concentration and feature of positive sample again
Topic.
It can also include normalization module in said apparatus embodiment, wherein:The normalization module, can be with
New value for the feature to user's sample is normalized.
The normalization module specifically includes determination subelement and processing subelement, wherein:
The determination subelement, is determined for the maxima and minima in the new value of the feature;
The processing subelement, can be used for each new value in the feature, as follows
Handled, using the numerical value after processing as the feature value:
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature,
Fmax、FminMaxima and minima in the respectively described new value of feature.
In addition, said apparatus can also include model training module, wherein:Model training module can be used for
User's sample input linear model after normalized is trained.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter
Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot
The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the application can be used
Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage,
CD-ROM, optical memory etc.) on the form of computer program product implemented.
The application is produced with reference to according to the method, equipment (system) and computer program of the embodiment of the present application
The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and
/ or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion
Formula processor or the processor of other programmable data processing devices are to produce a machine so that pass through and calculate
The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one
The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or the processing of other programmable datas to set
In the standby computer-readable memory worked in a specific way so that be stored in the computer-readable memory
Instruction produce include the manufacture of command device, the command device realization in one flow or multiple of flow chart
The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made
Obtain and perform series of operation steps on computer or other programmable devices to produce computer implemented place
Reason, so that the instruction performed on computer or other programmable devices is provided for realizing in flow chart one
The step of function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/defeated
Outgoing interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory
And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory (RAM).
Internal memory is the example of computer-readable medium.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by appointing
What method or technique realizes that information is stored.Information can be computer-readable instruction, data structure, program
Module or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory
(PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its
Random access memory (RAM), read-only storage (ROM), the electrically erasable of his type are read-only
Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage
(CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic
Disk storage or other magnetic storage apparatus or any other non-transmission medium, can be calculated available for storage
The information that equipment is accessed.Defined according to herein, computer-readable medium does not include temporary computer-readable matchmaker
The data-signal and carrier wave of body (transitory media), such as modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant be intended to it is non-
It is exclusive to include, so that process, method, commodity or equipment including a series of key elements are not only wrapped
Include those key elements, but also other key elements including being not expressly set out, or also include for this process,
Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, by sentence " including
One ... " limit key element, it is not excluded that in the process including key element, method, commodity or equipment
Also there is other identical element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey
Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and
The form of the embodiment of hardware aspect.Moreover, the application can be used wherein includes calculating one or more
Machine usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM,
Optical memory etc.) on the form of computer program product implemented.
Embodiments herein is these are only, the application is not limited to.For people in the art
For member, the application can have various modifications and variations.It is all to be made within spirit herein and principle
Any modification, equivalent substitution and improvements etc., should be included within the scope of claims hereof.
Claims (10)
1. a kind of user's sample characteristics optimized treatment method, it is characterised in that including:
Determine to include positive sample in the feature to be optimized of user's sample in user's sample set, user's sample set
This;
According to the value of the feature of each user's sample with predetermined N number of quantile by user's sample set
Interior user's sample is divided into N+1 interval, and N is the positive integer more than 1;
It is interval for N+1 each in interval, calculate the quantity occupied area of positive sample in each interval
Between overall user's sample size ratio;
The ratio calculated in each interval is defined as the feature of each user's sample in the interval
New value.
2. according to the method described in claim 1, it is characterised in that by what is calculated in each interval
Ratio is defined as in the interval after the new value of the feature of each user's sample, and methods described also includes:
New value to the feature of user's sample is normalized.
3. method according to claim 2, it is characterised in that to the feature of user's sample
New value is normalized, and specifically includes:
Determine the maxima and minima in the new value of the feature;
To each new value in the feature, handled as follows, by the numerical value after processing
It is used as the value of the feature:
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>n</mi>
<mi>e</mi>
<mi>w</mi>
</mrow>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>o</mi>
<mi>l</mi>
<mi>d</mi>
</mrow>
</msub>
<mo>-</mo>
<msub>
<mi>F</mi>
<mrow>
<mi>m</mi>
<mi>i</mi>
<mi>n</mi>
</mrow>
</msub>
</mrow>
<mrow>
<msub>
<mi>F</mi>
<mi>max</mi>
</msub>
<mo>-</mo>
<msub>
<mi>F</mi>
<mrow>
<mi>m</mi>
<mi>i</mi>
<mi>n</mi>
</mrow>
</msub>
</mrow>
</mfrac>
</mrow>
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature,
Fmax、FminMaxima and minima in the respectively described new value of feature.
4. according to the method described in claim 1, it is characterised in that by what is calculated in each interval
Ratio is defined as in the interval before the new value of the feature of each user's sample, and methods described also includes:
Select it is each it is interval in ratio and the predetermined quantile determined by be unsatisfactory between preset value
The feature of linear relationship.
5. according to the method described in claim 1, it is characterised in that according to each user's sample
User's sample in user's sample set is divided into N+1 area by the value of feature with predetermined N number of quantile
Between, specifically include:
It is ranked up according to the value of the feature of each user's sample;
Using corresponding value on N number of quantile as border, user's sample in user's sample set is divided into
N+1 interval.
6. the method according to any one of claim 2 to 3, it is characterised in that to user's sample
The feature new value be normalized after, methods described also includes, by the user after processing
Sample input linear model is trained.
7. a kind of user's sample characteristics optimization processing device, it is characterised in that including:Characteristic determination module,
Interval division module, ratio calculation module and characteristic value determining module, wherein:
The characteristic determination module, the feature to be optimized for determining user's sample in user's sample set is described
Include positive sample in user's sample set;
The interval division module, the value for the feature according to each user's sample is with predetermined N
User's sample in user's sample set is divided into N+1 interval by individual quantile, and N is the positive integer more than 1;
The ratio calculation module, for interval for N+1 each in interval, calculates each
In interval between the quantity occupied area of positive sample overall user's sample size ratio;
The characteristic value determining module, for the ratio calculated in each interval to be defined as in the interval
The new value of the feature of each user's sample.
8. device according to claim 7, it is characterised in that described device also includes normalization mould
Block, wherein:
The normalization module, the new value for the feature to user's sample is normalized.
9. device according to claim 8, it is characterised in that the normalization module is specifically included
Determination subelement and processing subelement, wherein:
The determination subelement, the maxima and minima in new value for determining the feature;
The processing subelement, for each new value in the feature, carrying out as follows
Processing, using the numerical value after processing as the feature value:
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>n</mi>
<mi>e</mi>
<mi>w</mi>
</mrow>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>o</mi>
<mi>l</mi>
<mi>d</mi>
</mrow>
</msub>
<mo>-</mo>
<msub>
<mi>F</mi>
<mrow>
<mi>m</mi>
<mi>i</mi>
<mi>n</mi>
</mrow>
</msub>
</mrow>
<mrow>
<msub>
<mi>F</mi>
<mi>max</mi>
</msub>
<mo>-</mo>
<msub>
<mi>F</mi>
<mrow>
<mi>m</mi>
<mi>i</mi>
<mi>n</mi>
</mrow>
</msub>
</mrow>
</mfrac>
</mrow>
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature,
Fmax、FminMaxima and minima in the respectively described new value of feature.
10. the device according to any one of claim 8 to 9, it is characterised in that described device is also wrapped
Model training module is included, wherein:The model training module, for by user's sample input line after processing
Property model is trained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610091834.7A CN107092919A (en) | 2016-02-18 | 2016-02-18 | A kind of user's sample characteristics optimized treatment method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610091834.7A CN107092919A (en) | 2016-02-18 | 2016-02-18 | A kind of user's sample characteristics optimized treatment method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107092919A true CN107092919A (en) | 2017-08-25 |
Family
ID=59646037
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610091834.7A Pending CN107092919A (en) | 2016-02-18 | 2016-02-18 | A kind of user's sample characteristics optimized treatment method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107092919A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796381A (en) * | 2019-10-31 | 2020-02-14 | 深圳前海微众银行股份有限公司 | Method and device for processing evaluation indexes of modeling data, terminal equipment and medium |
US10990500B2 (en) | 2018-05-18 | 2021-04-27 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for user analysis |
-
2016
- 2016-02-18 CN CN201610091834.7A patent/CN107092919A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10990500B2 (en) | 2018-05-18 | 2021-04-27 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for user analysis |
CN110796381A (en) * | 2019-10-31 | 2020-02-14 | 深圳前海微众银行股份有限公司 | Method and device for processing evaluation indexes of modeling data, terminal equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108520460A (en) | Business datum calculates processing method, device, computer equipment and storage medium | |
CN108460681A (en) | A kind of risk management and control method and device | |
CN106846041A (en) | The distribution method and system of reward voucher | |
CN110110335A (en) | A kind of name entity recognition method based on Overlay model | |
CN107506350A (en) | A kind of method and apparatus of identification information | |
CN112214652B (en) | Message generation method, device and equipment | |
CN110135701A (en) | Control automatic generation method, device, electronic equipment and the readable medium of rule | |
Baba et al. | Predicting regime switches in the VIX index with macroeconomic variables | |
CN110097450A (en) | Vehicle borrows methods of risk assessment, device, equipment and storage medium | |
CN107291733A (en) | A kind of method and device for rule matching | |
CN108090831A (en) | Credit Risk Assessment method, application server and computer readable storage medium | |
CN107545038A (en) | A kind of file classification method and equipment | |
CN108596765A (en) | A kind of Electronic Finance resource recommendation method and device | |
CN112966189A (en) | Fund product recommendation system | |
CN111882426A (en) | Business risk classifier training method, device, equipment and storage medium | |
CN109446391A (en) | User's reading behavior analysis method, electronic device, computer readable storage medium | |
CN110119353A (en) | Test data generating method, device and controller and medium | |
CN107092919A (en) | A kind of user's sample characteristics optimized treatment method and device | |
CN111179055A (en) | Credit limit adjusting method and device and electronic equipment | |
US11810026B2 (en) | Predictive data analysis using value-based predictive inputs | |
CN106897282A (en) | The sorting technique and equipment of a kind of customer group | |
CN104077288B (en) | Web page contents recommend method and web page contents recommendation apparatus | |
CN113869700A (en) | Performance index prediction method and device, electronic equipment and storage medium | |
CN107945034A (en) | Financial analysis method, application server and computer-readable recording medium based on microblogging finance and economics event | |
CN107590732A (en) | A kind of business datum calculation method and its equipment, terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170825 |
|
RJ01 | Rejection of invention patent application after publication |