CN107092919A - A kind of user's sample characteristics optimized treatment method and device - Google Patents

A kind of user's sample characteristics optimized treatment method and device Download PDF

Info

Publication number
CN107092919A
CN107092919A CN201610091834.7A CN201610091834A CN107092919A CN 107092919 A CN107092919 A CN 107092919A CN 201610091834 A CN201610091834 A CN 201610091834A CN 107092919 A CN107092919 A CN 107092919A
Authority
CN
China
Prior art keywords
sample
user
feature
interval
mrow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610091834.7A
Other languages
Chinese (zh)
Inventor
席炎
张柯
余舟华
漆远
杨军
李澜博
黄�俊
叶伟
郭曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610091834.7A priority Critical patent/CN107092919A/en
Publication of CN107092919A publication Critical patent/CN107092919A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

This application discloses a kind of user's sample characteristics optimized treatment method, for user's sample characteristics optimization processing, and then the characteristic value constructed is set more to fit the variation tendency of positive sample concentration.This method includes:Determine to include positive sample in the feature to be optimized of user's sample in user's sample set, user's sample set;User's sample in user's sample set is divided into by N+1 interval with predetermined N number of quantile according to the value of the feature of each user's sample, N is the positive integer more than 1;It is interval for N+1 each in interval, calculate the ratio of overall user's sample size between the quantity occupied area of positive sample in each interval;The ratio calculated in each interval is defined as to the new value of the feature of each user's sample in the interval.Disclosed herein as well is a kind of user's sample characteristics optimization processing device.

Description

A kind of user's sample characteristics optimized treatment method and device
Technical field
The application is related to field of computer technology, more particularly to a kind of user's sample characteristics optimized treatment method and Device.
Background technology
With continuing to develop for information technology, step at present the big data epoch, businessman or enterprise etc. can lead to The various service platforms for crossing its offer are collected into mass users sample, generally have in these user's samples a lot The amount of money of feature, such as the user purchase and consumption on network, the record of goods return and replacement buys financial investment product The amount of money, tightness degree of relation etc. between user A and user B passes through the spy to these user's samples Levy and handled, and then input model is trained, point of new user behavior can be predicted by finally giving Class model.Draw after disaggregated model, new user's sample is inputted into above-mentioned disaggregated model, warp by processing Cross model and calculate and user's sample can be predicted, for example, predicting the user for credit is good or credit It is poor etc..
It is typically that characteristic value is handled to obtain this feature when handling the feature of user's sample New value, conventional processing method is max min facture at present, and its step is as follows:The first step, Count the maxima and minima of user's sample characteristically;Second step, will using max min method The value of this feature of each user's sample is handled, and the new span of feature thus has been mapped into 0 To between 1.
Using above-mentioned max min facture to the processing of user's sample characteristics, easily make feature after processing New value can not fit the variation tendency of positive sample concentration, being finally likely to result in model can not in training Learn the linear rule of this feature well, so as to reduce the results of learning of model, cause the prediction of model Precise decreasing.
The content of the invention
Based on above-mentioned technical problem, the embodiment of the present application provide a kind of user's sample characteristics optimized treatment method and Device, for user's sample characteristics optimization processing, and then makes the characteristic value constructed more fit positive sample The variation tendency of concentration.
The embodiment of the present application uses following technical proposals:
A kind of user's sample characteristics optimized treatment method, including:Determine treating for user's sample in user's sample set Optimize in feature, user's sample set and include positive sample;According to the feature of each user's sample User's sample in user's sample set is divided into N+1 interval, N by value with predetermined N number of quantile For the positive integer more than 1;It is interval for N+1 each in interval, calculate in each interval just The ratio of overall user's sample size between the quantity occupied area of sample;The ratio calculated in each interval is true It is set to the new value of the feature of each user's sample in the interval.
Preferably, the ratio calculated in each interval is defined as in the interval described in each user's sample After the new value of feature, methods described also includes:New value to the feature of user's sample is returned One change is handled.
Preferably, the new value to the feature of user's sample is normalized, and specifically includes:Really Maxima and minima in the new value of the fixed feature;To each new value in the feature, Handled as follows, using the numerical value after processing as the feature value:
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature, Fmax、FminMaxima and minima in the respectively described new value of feature.
Preferably, the ratio calculated in each interval is defined as in the interval described in each user's sample Before the new value of feature, methods described also includes:Select it is each it is interval in ratio and described predetermined point The feature of linear relationship is unsatisfactory between preset value determined by site.
Preferably, according to the value of the feature of each user's sample with predetermined N number of quantile by user User's sample in sample set is divided into N+1 interval, specifically includes:According to each user's sample The value of feature is ranked up;Using corresponding value on N number of quantile as border, by user's sample set User's sample be divided into N+1 it is interval.
Preferably, after the new value of the feature of user's sample is normalized, the side Method also includes, and user's sample input linear model after processing is trained.
A kind of user's sample characteristics optimization processing device, including:Characteristic determination module, interval division module, Ratio calculation module and characteristic value determining module, wherein:The characteristic determination module, for determining user's sample Include positive sample in the feature to be optimized of user's sample, user's sample set in this collection;Described interval stroke Sub-module, the value for the feature according to each user's sample is with predetermined N number of quantile by user User's sample in sample set is divided into N+1 interval, and N is the positive integer more than 1;The ratio calculation Module, for interval for N+1 each in interval, calculates the number of positive sample in each interval The ratio of overall user's sample size between amount occupied area;The characteristic value determining module, for each is interval In the ratio that calculates be defined as the new value of the feature of each user's sample in the interval.
Preferably, described device also includes normalization module, wherein:The normalization module, for The new value of the feature of family sample is normalized.
Preferably, the normalization module specifically includes determination subelement and processing subelement, wherein:It is described Determination subelement, the maxima and minima in new value for determining the feature;Processing is single Member, for each new value in the feature, being handled as follows, after processing Numerical value as the feature value:
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature, Fmax、FminMaxima and minima in the respectively described new value of feature.
Preferably, described device also includes model training module, wherein:The model training module, is used for User's sample input linear model after processing is trained.
At least one above-mentioned technical scheme that the embodiment of the present application is used can reach following beneficial effect:It will use After user's sample demarcation interval in the sample set of family, calculate whole between the quantity occupied area of positive sample in each interval The ratio of body user's sample size, takes using the ratio that calculates as the new of the feature of each user's sample in interval Value, the new value of such feature can fit the growth or downward trend of positive sample concentration, may finally make mould Type is trained up to this feature.Meanwhile, each user's sample characteristics in interval are used as using the ratio that calculates New value, linear relationship is met between the new value of concentration and feature that positive sample can be made, while solve again The problem of linear relationship being unsatisfactory between the concentration and feature new value of positive sample.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes one of the application Point, the schematic description and description of the application is used to explain the application, does not constitute to the application not Work as restriction.In the accompanying drawings:
The implementation process signal for user's sample characteristics optimized treatment method that Fig. 1 provides for the embodiment of the present application Figure;
Preset determined by ratio and predetermined quantile in each interval that Fig. 2 provides for the embodiment of the present application Linear relationship schematic diagram is met between value;
Preset determined by ratio and predetermined quantile in each interval that Fig. 3 provides for the embodiment of the present application Linear relationship schematic diagram is unsatisfactory between value;
The structured flowchart for user's sample characteristics optimization processing device that Fig. 4 provides for the embodiment of the present application.
Embodiment
It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer Apply example and technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, it is described Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made Example, belongs to the scope of the application protection.
The implementation process signal for user's sample characteristics optimized treatment method that Fig. 1 provides for the embodiment of the present application Figure, mainly including following steps:
Step 11:Determine the feature to be optimized of user's sample in user's sample set;
User's sample is usually the historical data related to user collected, such as user basic information, silver Row accounts information, social information, hobby preference, use information, payment information, debt information etc., here Historical data specifically include multiple features, and these features are often embodied with the form of numerical value.With Exemplified by personal reference business, the feature included by the historical data of user's sample is done shopping for user on network to disappear The specific amount of money taken, the specific number of times of goods return and replacement buys the specific amount of money of financial investment product, credit card Refund situation, amount owed of credit card etc. in certain time.As can be seen here, user's sample is related to very Multiple features, therefore first have to determine that a certain feature either certain several feature come out point from user's sample Analysis, can be by each feature in multiple features respectively by following if once determining multiple features Step 12,13,14 are operated.
It is typically user's sample in user's sample set in user's sample set for above-mentioned user's sample Quantity can set according to the actual needs, for example can be to be thousands of to tens of thousands of, general user's sample Quantity is more more more can more accurately reflect actual conditions, but either construction Feature Engineering either carries out model During training, it calculates and handled accordingly situation that the then time then can be more than user's sample size when less.
In addition, including positive sample in user's sample here.Wherein, positive sample can be by label The sample of change, namely positive sample are that its belonging kinds is by hand labeled either computer identification mark Know.The poor sample of credit in user's sample is for example subjected to labeling, and the poor user of credit here Sample is then referred to as positive sample, accordingly can be with creditable preferably anti-sample etc..
Step 12:It will be used with predetermined N number of quantile according to the value of the feature of each user's sample User's sample in the sample set of family is divided into N+1 interval;
Feature in user's sample is often embodied with the form of numerical value, and the quantity of user's sample is more, Can according to the value of the feature of the determination of each user's sample with predetermined N number of quantile by user's sample User's sample in collection is divided into N+1 interval, and wherein N is positive integer.For example, determining user's sample This feature to be optimized is spending amount, first can be by all user's samples in specific demarcation interval Spending amount is ranked up according to order from small to large, the pattern after sequence can for (100,202 ..., 25000,30000) etc., it is necessary to which explanation, the value of each feature after sequence should also be with original subscriber Sample is corresponding;After having sorted, according to the size of the spending amount of each user's sample with predetermined N User's sample in user's sample set is divided into N+1 interval by individual quantile.
For above-mentioned quantile, citing is illustrated, and such as one has 100 samples, and sample is divided into 10 intervals, can after sequence the 10th, 20 ... the number on 90 is used as quantile.
In addition, in above-mentioned N+1 is interval, the quantity of user's sample in each interval can be equalization, The sample of such as user is 100 altogether, correspondingly, has 100 here by the number after sorting from small to large According to, above-mentioned data can be divided into by 10 intervals according to predetermined N number of quantile, i.e., it is each it is interval in There are 10 user's samples.It is certainly each in addition to the quantity of user's sample in each interval is equalization The quantity of user's sample in individual interval can also be distributed according to a certain percentage, and the present embodiment is not made to this Limitation.
Step 13:It is interval for N+1 each in interval, calculate positive sample in each interval Quantity occupied area between overall user's sample size ratio;
As it was previously stated, user's sample includes the positive sample of labeling, the processing by step 12 will be used Family sample is divided into after N+1 interval, then can be interval for each, calculate the interval just The quantity of sample accounts for the ratio of the overall user's sample size in the interval.Such as a certain interval user's sample size 10 are total up to, and the quantity of positive sample is 3, then the quantity for calculating positive sample in the interval accounts for this The ratio 0.3 of interval entirety user's sample size.
It should be noted that referred to herein as positive sample and overall user's sample size, in overall user's sample In quantity, other samples in addition to positive sample can be with all anti-samples;Certainly can also be part The sample of unknown classification, the anti-sample in part;It can also be the sample of entirely unknown classification.For example by user Credit be divided into have a good credit it is poor with credit, credit difference for positive sample, have a good credit for anti-sample, used overall Other samples in the sample size of family in addition to the poor sample of credit, can have a good credit be anti-with all Sample;Certainly can also be the sample of the unknown classification in part, the partly anti-sample to have a good credit;Can also be The sample of entirely unknown credit category.
Step 14:The ratio calculated in each interval is defined as in the interval described in each user's sample The new value of feature.
Calculated in step 13 in N+1 interval, between the quantity occupied area of the positive sample in each interval The ratio of overall user's sample size, has calculated N+1 ratio accordingly, and this step then can be for upper Each in the N+1 stated is interval, and the ratio calculated in the interval is defined as into each user in the interval The new value of the feature of sample.Also by before for an example exemplified by, a certain interval user's sample This quantity is total up to 10, and the quantity for calculating positive sample in the interval accounts for the overall user's sample number in the interval The ratio 0.3 of amount, then can be defined as 0.3, separately by the new value of the feature of the 10 user's samples in interval The outer operation that this step is performed both by for each interval.
The above-mentioned technical proposal that the embodiment of the present application is used, by user's sample demarcation interval in user's sample set Afterwards, the ratio of overall user's sample size between the quantity occupied area of positive sample in each interval is calculated, to calculate The ratio gone out is as the new value of the feature of each user's sample in interval, and the new value of such feature can be use up can The growth or downward trend of the laminating positive sample concentration of energy, may finally make model fully instruct this feature Practice, Optimized model fitting degree.Meanwhile, each user's sample characteristics in interval are used as using the ratio that calculates Linear relationship is met between new value, the new value of concentration and feature that positive sample can be made, while solve again The problem of linear relationship being unsatisfactory between the new value of concentration and feature of positive sample.
After the step 14 of above-described embodiment, above-described embodiment can also comprise the following steps:To user's sample The new value of this feature is normalized., can be using linear when being normalized Maximin method, averaging method or median method in function method etc..Maximin method can be by feature Value is normalized in the range of [0,1];Averaging method can normalize to the value of feature in any range, But the symbol of maxima and minima can not change simultaneously;Median method can normalize to the value of feature In the range of [- 1,1].Certainly can also be other normalization algorithms.
By this normalized, the most value at last after normalized is limited in the range of needs, main If convergence rate when finally entering model training for the convenience of subsequent data processing and guarantee is accelerated. Following method for normalizing can specifically be used:
The first step:Determine the maxima and minima in the new value of the feature;
Second step:To each new value in the feature, handled, will be handled as follows Numerical value afterwards as the feature value:
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature, Fmax、FminMaxima and minima in the respectively described new value of feature.
As it was previously stated, generally including many features in user's sample, the new value to all features is carried out After normalized, embodiment of the method before can also comprise the following steps:By user's sample after processing This input linear model is trained.As it was previously stated, including many features in user's sample, here may be used To pick out Partial Feature all by the step 11 of above-described embodiment, step 12, step 13 and step 14 After processing and then it is normalized, finally enters linear model and be trained.And then make model to upper State the feature selected to be trained up, Optimized model fitting degree, most lift scheme estimated performance at last With effect.
Before the step 14 of above-described embodiment, methods described embodiment can also comprise the following steps:Choosing Linear pass is unsatisfactory between preset value determined by ratio and the predetermined quantile in each interval of taking-up The feature of system.Whether the ratio that linear relationship here is referred mainly in each interval is a little identified pre- with being divided into If the increase of value and monotonic increase, or be it is each it is interval in ratio whether be divided into a little determined by advance If value increase and monotone decreasing.For preset value determined by N number of quantile, the preset value can be To be divided into the interval a little divided for foundation, it can also directly be divided into specific value a little, or with Quantile is in a certain numerical value of certain proportion relation etc..
Specifically, when selecting the feature for being unsatisfactory for linear relationship, it can first determine whether in each interval Whether linear relationship is met between the preset value that ratio and predetermined N number of quantile are determined respectively, sentenced Step 11 can be combined when disconnected, the result of step 12 and step 13, can be with for convenience of judging Draw the Bi-Var curve maps of new feature according to above-mentioned result, and according to figure whether have monotonicity come Judge whether meet linear relationship between ratio and predetermined N number of quantile in each interval.
Fig. 2 and Fig. 3 is the result of step 12 and step 13 according to step 11, draws new feature Bi-Var curve maps, wherein the abscissa of curve for using be divided into a little determined by preset value by according to being divided Interval, ordinate accounts for the ratio of interval entirety user's sample size for the quantity of the positive sample in each interval Value, picture here simply facilitates understandings certainly, in practice abscissa interval between be probably it is continuous, It is also likely to be discrete.As can be seen that Fig. 2 Bi-Var curve maps are to meet linear relationship, i.e. ordinate Numerical value be increased monotonically with the numerical value of abscissa, and the Bi-Var curve maps shown in Fig. 3 is are unsatisfactory for line Sexual intercourse.Therefore before being normalized at step 14, can select the judged result is Feature under conditions of no, i.e., only carry out the processing of step 14 to the characteristic value for being unsatisfactory for linear relationship.
In the step 12 of above-described embodiment, according to the value of the feature of each user's sample with predetermined User's sample in user's sample set is divided into N+1 interval by N number of quantile, can specifically be used as follows Method:The first step:It is ranked up according to the value of the feature of each user's sample;Second step:By N User's sample in user's sample set is divided into N+1 area by corresponding value as border on individual quantile Between.For example there are 100 user's samples, being ordered as the 10th, 20 ..., 9 numbers on 90 are as pre- Determine quantile, user's sample is divided into 10 intervals by border of this numerical value.The 20th can also will be ordered as, User's sample is divided into 5 intervals by 4 numbers on 40 ... 80 as quantile by border of this numerical value.Separately Outside, the quantity of user's sample can be with equal in any two interval in above-mentioned N+1 interval.
Above-mentioned several embodiments are all the present processes embodiment, correspondingly, and the application also provides a kind of use Family sample characteristics optimization processing device embodiment, for user's sample characteristics optimization processing, and then makes construction The characteristic value gone out is more fitted the variation tendency of positive sample concentration.Fig. 4 is specifically shown in, including:Feature determines mould Block 21, interval division module 22, ratio calculation module 23 and characteristic value determining module 24, wherein:
The characteristic determination module 21, is determined for the spy to be optimized of user's sample in user's sample set Levy, include positive sample in user's sample set;
The interval division module 22, can be used for the value according to the feature of each user's sample with pre- User's sample in user's sample set is divided into N+1 interval by fixed N number of quantile, and N is more than 1 Positive integer;
The ratio calculation module 23, can be used for, for N+1 each interval in interval, counting Calculate the ratio of overall user's sample size between the quantity occupied area of positive sample in each interval;
The characteristic value determining module 24, can be used for the ratio calculated in each interval being defined as this The new value of the feature of each user's sample in interval.
When present apparatus embodiment works, characteristic determination module determines feature to be optimized first, then interval to draw User's sample in user's sample set is divided into N+1 interval by sub-module, and ratio calculation module calculates each In interval between the quantity occupied area of positive sample overall user's sample size ratio, last characteristic value determining module will The ratio calculated in each interval is defined as the new value of the feature of each user's sample in the interval. So new value of feature can fit the growth or downward trend of positive sample concentration completely, may finally make mould Type is trained up to this feature, Optimized model fitting degree.Meanwhile, area is used as using the ratio that calculates Line is met between the new value of interior each user's sample characteristics, the new value of concentration and feature that positive sample can be made Sexual intercourse, while being unsatisfactory for asking for linear relationship between solving the new value of concentration and feature of positive sample again Topic.
It can also include normalization module in said apparatus embodiment, wherein:The normalization module, can be with New value for the feature to user's sample is normalized.
The normalization module specifically includes determination subelement and processing subelement, wherein:
The determination subelement, is determined for the maxima and minima in the new value of the feature;
The processing subelement, can be used for each new value in the feature, as follows Handled, using the numerical value after processing as the feature value:
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature, Fmax、FminMaxima and minima in the respectively described new value of feature.
In addition, said apparatus can also include model training module, wherein:Model training module can be used for User's sample input linear model after normalized is trained.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the application can be used Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) on the form of computer program product implemented.
The application is produced with reference to according to the method, equipment (system) and computer program of the embodiment of the present application The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and / or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion Formula processor or the processor of other programmable data processing devices are to produce a machine so that pass through and calculate The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or the processing of other programmable datas to set In the standby computer-readable memory worked in a specific way so that be stored in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow or multiple of flow chart The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made Obtain and perform series of operation steps on computer or other programmable devices to produce computer implemented place Reason, so that the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/defeated Outgoing interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory (RAM). Internal memory is the example of computer-readable medium.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by appointing What method or technique realizes that information is stored.Information can be computer-readable instruction, data structure, program Module or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its Random access memory (RAM), read-only storage (ROM), the electrically erasable of his type are read-only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic Disk storage or other magnetic storage apparatus or any other non-transmission medium, can be calculated available for storage The information that equipment is accessed.Defined according to herein, computer-readable medium does not include temporary computer-readable matchmaker The data-signal and carrier wave of body (transitory media), such as modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant be intended to it is non- It is exclusive to include, so that process, method, commodity or equipment including a series of key elements are not only wrapped Include those key elements, but also other key elements including being not expressly set out, or also include for this process, Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, by sentence " including One ... " limit key element, it is not excluded that in the process including key element, method, commodity or equipment Also there is other identical element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the application can be used wherein includes calculating one or more Machine usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM, Optical memory etc.) on the form of computer program product implemented.
Embodiments herein is these are only, the application is not limited to.For people in the art For member, the application can have various modifications and variations.It is all to be made within spirit herein and principle Any modification, equivalent substitution and improvements etc., should be included within the scope of claims hereof.

Claims (10)

1. a kind of user's sample characteristics optimized treatment method, it is characterised in that including:
Determine to include positive sample in the feature to be optimized of user's sample in user's sample set, user's sample set This;
According to the value of the feature of each user's sample with predetermined N number of quantile by user's sample set Interior user's sample is divided into N+1 interval, and N is the positive integer more than 1;
It is interval for N+1 each in interval, calculate the quantity occupied area of positive sample in each interval Between overall user's sample size ratio;
The ratio calculated in each interval is defined as the feature of each user's sample in the interval New value.
2. according to the method described in claim 1, it is characterised in that by what is calculated in each interval Ratio is defined as in the interval after the new value of the feature of each user's sample, and methods described also includes:
New value to the feature of user's sample is normalized.
3. method according to claim 2, it is characterised in that to the feature of user's sample New value is normalized, and specifically includes:
Determine the maxima and minima in the new value of the feature;
To each new value in the feature, handled as follows, by the numerical value after processing It is used as the value of the feature:
<mrow> <msub> <mi>F</mi> <mrow> <mi>n</mi> <mi>e</mi> <mi>w</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>F</mi> <mrow> <mi>o</mi> <mi>l</mi> <mi>d</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>F</mi> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> </msub> </mrow> <mrow> <msub> <mi>F</mi> <mi>max</mi> </msub> <mo>-</mo> <msub> <mi>F</mi> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> </msub> </mrow> </mfrac> </mrow>
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature, Fmax、FminMaxima and minima in the respectively described new value of feature.
4. according to the method described in claim 1, it is characterised in that by what is calculated in each interval Ratio is defined as in the interval before the new value of the feature of each user's sample, and methods described also includes:
Select it is each it is interval in ratio and the predetermined quantile determined by be unsatisfactory between preset value The feature of linear relationship.
5. according to the method described in claim 1, it is characterised in that according to each user's sample User's sample in user's sample set is divided into N+1 area by the value of feature with predetermined N number of quantile Between, specifically include:
It is ranked up according to the value of the feature of each user's sample;
Using corresponding value on N number of quantile as border, user's sample in user's sample set is divided into N+1 interval.
6. the method according to any one of claim 2 to 3, it is characterised in that to user's sample The feature new value be normalized after, methods described also includes, by the user after processing Sample input linear model is trained.
7. a kind of user's sample characteristics optimization processing device, it is characterised in that including:Characteristic determination module, Interval division module, ratio calculation module and characteristic value determining module, wherein:
The characteristic determination module, the feature to be optimized for determining user's sample in user's sample set is described Include positive sample in user's sample set;
The interval division module, the value for the feature according to each user's sample is with predetermined N User's sample in user's sample set is divided into N+1 interval by individual quantile, and N is the positive integer more than 1;
The ratio calculation module, for interval for N+1 each in interval, calculates each In interval between the quantity occupied area of positive sample overall user's sample size ratio;
The characteristic value determining module, for the ratio calculated in each interval to be defined as in the interval The new value of the feature of each user's sample.
8. device according to claim 7, it is characterised in that described device also includes normalization mould Block, wherein:
The normalization module, the new value for the feature to user's sample is normalized.
9. device according to claim 8, it is characterised in that the normalization module is specifically included Determination subelement and processing subelement, wherein:
The determination subelement, the maxima and minima in new value for determining the feature;
The processing subelement, for each new value in the feature, carrying out as follows Processing, using the numerical value after processing as the feature value:
<mrow> <msub> <mi>F</mi> <mrow> <mi>n</mi> <mi>e</mi> <mi>w</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>F</mi> <mrow> <mi>o</mi> <mi>l</mi> <mi>d</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>F</mi> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> </msub> </mrow> <mrow> <msub> <mi>F</mi> <mi>max</mi> </msub> <mo>-</mo> <msub> <mi>F</mi> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> </msub> </mrow> </mfrac> </mrow>
Wherein, FnewFor the numerical value after being handled, FoldTo carry out the new value of processing foregoing description feature, Fmax、FminMaxima and minima in the respectively described new value of feature.
10. the device according to any one of claim 8 to 9, it is characterised in that described device is also wrapped Model training module is included, wherein:The model training module, for by user's sample input line after processing Property model is trained.
CN201610091834.7A 2016-02-18 2016-02-18 A kind of user's sample characteristics optimized treatment method and device Pending CN107092919A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610091834.7A CN107092919A (en) 2016-02-18 2016-02-18 A kind of user's sample characteristics optimized treatment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610091834.7A CN107092919A (en) 2016-02-18 2016-02-18 A kind of user's sample characteristics optimized treatment method and device

Publications (1)

Publication Number Publication Date
CN107092919A true CN107092919A (en) 2017-08-25

Family

ID=59646037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610091834.7A Pending CN107092919A (en) 2016-02-18 2016-02-18 A kind of user's sample characteristics optimized treatment method and device

Country Status (1)

Country Link
CN (1) CN107092919A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796381A (en) * 2019-10-31 2020-02-14 深圳前海微众银行股份有限公司 Method and device for processing evaluation indexes of modeling data, terminal equipment and medium
US10990500B2 (en) 2018-05-18 2021-04-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for user analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10990500B2 (en) 2018-05-18 2021-04-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for user analysis
CN110796381A (en) * 2019-10-31 2020-02-14 深圳前海微众银行股份有限公司 Method and device for processing evaluation indexes of modeling data, terminal equipment and medium

Similar Documents

Publication Publication Date Title
CN108520460A (en) Business datum calculates processing method, device, computer equipment and storage medium
CN108460681A (en) A kind of risk management and control method and device
CN106846041A (en) The distribution method and system of reward voucher
CN110110335A (en) A kind of name entity recognition method based on Overlay model
CN107506350A (en) A kind of method and apparatus of identification information
CN112214652B (en) Message generation method, device and equipment
CN110135701A (en) Control automatic generation method, device, electronic equipment and the readable medium of rule
Baba et al. Predicting regime switches in the VIX index with macroeconomic variables
CN110097450A (en) Vehicle borrows methods of risk assessment, device, equipment and storage medium
CN107291733A (en) A kind of method and device for rule matching
CN108090831A (en) Credit Risk Assessment method, application server and computer readable storage medium
CN107545038A (en) A kind of file classification method and equipment
CN108596765A (en) A kind of Electronic Finance resource recommendation method and device
CN112966189A (en) Fund product recommendation system
CN111882426A (en) Business risk classifier training method, device, equipment and storage medium
CN109446391A (en) User&#39;s reading behavior analysis method, electronic device, computer readable storage medium
CN110119353A (en) Test data generating method, device and controller and medium
CN107092919A (en) A kind of user&#39;s sample characteristics optimized treatment method and device
CN111179055A (en) Credit limit adjusting method and device and electronic equipment
US11810026B2 (en) Predictive data analysis using value-based predictive inputs
CN106897282A (en) The sorting technique and equipment of a kind of customer group
CN104077288B (en) Web page contents recommend method and web page contents recommendation apparatus
CN113869700A (en) Performance index prediction method and device, electronic equipment and storage medium
CN107945034A (en) Financial analysis method, application server and computer-readable recording medium based on microblogging finance and economics event
CN107590732A (en) A kind of business datum calculation method and its equipment, terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170825

RJ01 Rejection of invention patent application after publication