Specific embodiment
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with the application specific embodiment and
Technical scheme is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the application one
Section Example, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing
Every other embodiment obtained under the premise of creative work out, shall fall in the protection scope of this application.
It should be noted that data nonbalance problem be for data carry out assorting process in caused by, that is, by
The difference of sample size included in different data collection is larger after classification.This specification method as described in the examples, in addition to suitable
For the imbalance of the black and white sample size in air control scene, the data nonbalance being equally applicable in other two classification scenes is asked
Topic.In subsequent description explanation, it will focus on and be illustrated with the black and white sample imbalance problem in air control scene.
Generally, in one or more embodiments of this specification, training sample can be from the history having occurred and that
It is determined in business event or in existing business division, specifically, black sample may be considered the business with high risk
Event or business division, and white sample is then regarded as normal business event or business division, such as: fraudulent trading is believed that
It is black sample, normal transaction is regarded as white sample;Another example is: risk account is regarded as black sample, normal account can recognize
To be white sample.
Wherein, the history service event, it is believed that be the business for having executed and having produced in history accordingly result
Operation, such as: pay, place an order, transfer accounts, draw a lottery, vote, here mentioned business as a result, may include: successfully, unsuccessfully,
Limit power etc., then, it can be according to the business corresponding to business event as a result, to determine that business event is white sample or black sample.
The business division, it is believed that it is the main body for issuing business operation, in this specification embodiment, business division
It can include but is not limited to: account, user itself, terminal, server of user etc..Further, user here include but
It is not limited to: personal user, enterprise customer, trade company, business provider etc..
Certainly, in practical applications, for black sample and white sample, specifically can according to the needs of practical application into
Row definition, is not construed as the restriction to the application here.
This specification air control model treatment method as described in the examples based on data nonbalance can be used as shown in Figure 1
Framework.
In Fig. 1, it (includes: the business event occurred or in sample data that processing equipment, which can obtain sample data,
The mark of some business divisions), in these sample datas, the side that corresponding identification model can be first passed through in advance or artificially marked
Formula determines black sample therein and white sample.As a result, processing equipment can be executed based on black, white sample model foundation and
Deployment.
In general, processing equipment is regarded as providing the business device of business service, such as: being capable of providing transactional services
Trading server, be capable of providing down unifunctional server, the server etc. with prize drawing algorithm.Certainly, it is actually answering
In, processing equipment should not be limited only to server, it is also possible to the equipment such as mobile phone, tablet computer, computer.
In the case where processing equipment is the scene of server, specifically can be used such as cluster server, distributed server or
The framework of single server will be specifically arranged according to the needs of practical application as which kind of framework used, and not make here specific
It limits.
In addition, it is as described in the examples based on data to be able to carry out this specification in addition to processing equipment shown in Fig. 1
The executing subject of unbalanced air control model treatment method, it is also possible to the non-hardware main body such as application program/service.Likewise,
It will specifically be determined according to the needs of practical application, and should not constitute the restriction to the application here.
It should be noted that predefining, the black and white sample is usually unbalanced, and white sample is often in the great majority.Therefore it needs
Execute the method in this specification embodiment.
Technical solution in this specification embodiment described in detail below.
A kind of air control model treatment method based on unbalanced data is provided in this specification embodiment, as shown in Fig. 2,
May include specifically following steps:
S201: the sample data to be processed comprising uneven sample is obtained.
Content based on description above-mentioned is it is found that sample data to be processed can derive from corresponding business device, such as: industry
Business database, server or the terminal for participating in business etc. can also will generally depend upon specific from the business occurred in real time
Business, the acquisition process of sample data to be processed is repeated without excessive here.
It is understood that the sample data to be processed got is determined black sample data and white sample number
According to, and there is imbalance problems for black, white sample data, so in order to weaken as much as possible or eliminate black, white sample data it
Between imbalance, subsequent step will be executed.
Step S203: being divided for the sample data to be processed, obtains multiple sample data sets to be processed.
In this specification embodiment, division to sample data to be processed is usually directed to a fairly large number of sample number
According to being divided, and for the sample data of negligible amounts itself, usually without dividing.After being divided, phase can be obtained
Answer the sample data set to be processed of quantity.
What needs to be explained here is that in practical applications, corresponding division number can be arranged according to the needs of practical application
Amount, here and is not especially limited.
Step S205: Rating Model is constructed according to obtained the multiple sample data set to be processed is divided, and to described
Sample to be processed scores.
Wherein, the scoring is for characterizing the correlation between unbalanced data.
As previously mentioned, in this specification embodiment, sample data to be processed generally comprise two class sample datas (such as: it is black,
White sample), in these sample datas, especially in that biggish a kind of sample data of quantity, with the lesser one kind of data volume
Correlation between sample data is different.
For example, the risk trade under the scene using historical trading as sample data to be processed, in historical trading
It can be regarded as black sample, and other transaction for not being judged as risk then can be regarded as white sample.May have in white sample
Part transaction be actually risk trade, for no other reason than that it is unrecognized go out, to be classified as white sample.But this part is unrecognized
It is that there is certain general character (that is, having correlation) between risk trade out and black sample.Therefore, in order to quantify therebetween
Correlation, so that it may construct corresponding Rating Model.
Here it is noted that in this specification embodiment, above-mentioned Rating Model not will do it deployment publication, and
It is only to carry out scoring processing for sample data to be processed.
Step S207: sample to be processed is sampled according to the scoring, and constructs air control model based on sampling results
And it disposes.
After being scored for each sample data to be processed, it can be carried out based on the score value of each sample data to be processed
Sampling, to construct corresponding air control model.
In conjunction with aforementioned, the model of final deployment publication only has one, that is, air control model described in this step S207.
It through the above steps, can be for wherein for the sample data to be processed for data nonbalance problem occur
A fairly large number of sample data is divided, and multiple data sets are obtained, on this basis, can be used each data set with it is unallocated
Sample data building respective numbers Rating Model, and further respectively using Rating Model to the sample to be processed divided
Notebook data scores.The scoring is able to reflect out the sample data to be processed divided and the sample to be processed not divided
Correlation between notebook data.So as to carry out sample sampling based on scoring, certainly, sampling is for this divided portion
Divide sample data.May finally according to the sample data sampled out and not divided sample data construct air control model, and portion
Administration.
Using the above method in this specification embodiment, the mode based on scoring, can more efficiently from quantity compared with
The higher sample data of sample data correlation with negligible amounts, the sample thus selected are selected in more sample datas
Data can optimize risk model, and eliminate existing data nonbalance problem between sample data.Moreover, in the process,
Multiple Rating Models not will do it deployment, and finally only risk model is disposed, so as to reduce the deployment to model
Cost.
For above content, now it is illustrated by taking black, the white sample under practical application scene as an example.
In practical applications, the quantity of white sample is typically much deeper than the quantity of black sample.So, in this specification embodiment
In, it can be divided for white sample.That is, being divided for the sample data to be processed, obtain multiple wait locate
Sample data set is managed, which can be with are as follows: divide according to the division numbers of setting to the white sample, obtain setting quantity
Multiple white sample sets.
Wherein, the quantity of white sample included in each white sample set may be the same or different.Here do not make
It is specific to limit.
After having obtained multiple white sample sets, each white sample set and the building of the black sample of full dose can be used to comment
Sub-model.It is to be appreciated that the quantity of obtained Rating Model and the quantity of white sample set are consistent.
Hereafter, Rating Model can be used to score sample data, in this specification embodiment, will used each
A Rating Model respectively scores to each white sample data, that is, if there is M Rating Model, then, after scoring,
Each white equal M scoring score value of sample data.So the process to score to the sample to be processed can be with are as follows: respectively
It is scored using the obtained multiple Rating Models of building for each white sample data, for each white sample data,
The scoring score value for counting multiple Rating Models, obtains summarizing score value.
Summarized it is possible to further the M scoring for each white sample data.Certainly, the mode summarized has
It is a variety of, such as: summation, weighting etc., as a kind of feasible embodiment, by the way of summation.It should be noted that the remittance
The size of total score reflects white sample close to the degree of black sample, that is, score value is bigger, and the white sample is closer to black sample area
Domain is also more difficult to train.
Next, can be sampled based on the score value that summarizes of each white sample data, dialogue sample.Specifically,
It, can be by the way of weighted sample in this specification embodiment, that is, sample to be processed is sampled according to the scoring
Process can be with are as follows: summarize score value based on described, determine the weight of each white sample data, carried out for white sample data
Weighted sample, the white sample data after being sampled.
Wherein, in practical applications, white sample data can be summarized into score value directly as the power of the white sample data
Weight, and the bigger weight the easier is drawn.
The quantity of process more than, the quantity and black sample data of the white sample data sampled is almost the same,
To eliminate the data nonbalance between black, white sample.
Finally risk model, and carry out portion can be constructed based on the white sample data and black sample data that sampling obtains
Administration.
As shown in figure 3, the practical implementation of above-mentioned scene specifically can comprise the following steps that
Step S301: being divided for white sample data, obtains the white sample set of M group.
Step S303: M Rating Model is constructed based on each group of white black sample data of sample set full dose.
Step S305: it is scored using M Rating Model each white sample data, and summarizes point of each white sample data
Value.
Step S307: sampling is weighted based on score value dialogue sample data is summarized.
Step S309: the white sample data and black sample data building risk model after sampling are used.
The above are the data processing methods that this specification embodiment provides, and are based on same thinking, this specification embodiment
A kind of air control model treatment device based on unbalanced data is also provided, as shown in figure 4, described device includes:
Module 401 is obtained, the sample data to be processed comprising uneven sample is obtained;
Division module 402 is divided for the sample data to be processed, obtains multiple sample data sets to be processed;
Grading module 403 constructs Rating Model according to obtained the multiple sample data set to be processed is divided, and to institute
Sample to be processed is stated to score;Wherein, to the scoring of the sample to be processed for characterize unbalanced sample to be processed it
Between correlation;
Deployment module 404 is constructed, sample to be processed is sampled according to the scoring, and construct wind based on sampling results
Control model is simultaneously disposed.
Further, the acquisition module 401 obtains to be processed and predetermined black sample data and white sample number
According to.
The division module 402 divides the white sample data according to the division numbers of setting, obtains setting number
Multiple white sample sets of amount.
Institute's scoring module 403 divides obtained white sample set for any, according to the black of the white sample set and full dose
Sample data constructs Rating Model;
Wherein, the quantity of the Rating Model of building is consistent with the quantity of white sample set that division obtains.
Institute's scoring module 402, the multiple Rating Models obtained respectively using building are carried out for each white sample data
Scoring, for each white sample data, counts the scoring score value of multiple Rating Models, obtains summarizing score value.
The building deployment module 403, summarizes score value based on described, determines the weight of each white sample data, needle
Dialogue sample data is weighted sampling, the white sample data after being sampled.
The building deployment module 403, based on the white sample data and the black sample data building air control after sampling
Model, and dispose
Based on device as shown in Figure 4, this specification embodiment also provides a kind of air control model based on unbalanced data
Processing equipment (specifically can be such as: server, computer), comprising:
Memory stores the air control model treatment program based on unbalanced data;
Processor calls the air control model treatment program based on unbalanced data stored in memory, and executes:
Obtain the sample data to be processed comprising uneven sample;
It is divided for the sample data to be processed, obtains multiple sample data sets to be processed;
Rating Model is constructed according to obtained the multiple sample data set to be processed is divided, and to the sample to be processed
It scores;Wherein, to the scoring of the sample to be processed for characterizing the correlation between unbalanced sample to be processed;
Sample to be processed is sampled according to the scoring, and based on sampling results building air control model and is disposed.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device,
For equipment and medium class embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, related place
Illustrate referring to the part of embodiment of the method, just no longer repeats one by one here.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims
It is interior.In some cases, the movement or step recorded in detail in the claims or module can be according to different from embodiments
Sequence executes and still may be implemented desired result.In addition, process depicted in the drawing is not necessarily required and is shown
Particular order or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing
It is also possible or may be advantageous.
In the 1990s, the improvement of a technology can be distinguished clearly be on hardware improvement (for example,
Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So
And with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit.
Designer nearly all obtains corresponding hardware circuit by the way that improved method flow to be programmed into hardware circuit.Cause
This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device
(Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate
Array, FPGA)) it is exactly such a integrated circuit, logic function determines device programming by user.By designer
Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, designs and makes without asking chip maker
Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " is patrolled
Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development,
And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language
(Hardware Description Language, HDL), and HDL is also not only a kind of, but there are many kind, such as ABEL
(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description
Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL
(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby
Hardware Description Language) etc., VHDL (Very-High-Speed is most generally used at present
Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also answer
This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages,
The hardware circuit for realizing the logical method process can be readily available.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing
The computer for the computer readable program code (such as software or firmware) that device and storage can be executed by (micro-) processor can
Read medium, logic gate, switch, specific integrated circuit (Application Specific Integrated Circuit,
ASIC), the form of programmable logic controller (PLC) and insertion microcontroller, the example of controller includes but is not limited to following microcontroller
Device: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320 are deposited
Memory controller is also implemented as a part of the control logic of memory.It is also known in the art that in addition to
Pure computer readable program code mode is realized other than controller, can be made completely by the way that method and step is carried out programming in logic
Controller is obtained to come in fact in the form of logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc.
Existing identical function.Therefore this controller is considered a kind of hardware component, and to including for realizing various in it
The device of function can also be considered as the structure in hardware component.Or even, it can will be regarded for realizing the device of various functions
For either the software module of implementation method can be the structure in hardware component again.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used
Think personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play
It is any in device, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or these equipment
The combination of equipment.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this
The function of each unit can be realized in the same or multiple software and or hardware when application.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), the data letter number and carrier wave of such as modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want
There is also other identical elements in the process, method of element, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application
Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The application can describe in the general context of computer-executable instructions executed by a computer, such as program
Module.Generally, program module includes routine, programs, objects, the group for executing particular transaction or realizing particular abstract data type
Part, data structure etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments, by
Affairs are executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with
In the local and remote computer storage media including storage equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality
For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method
Part explanation.
The above description is only an example of the present application, is not intended to limit this application.For those skilled in the art
For, various changes and changes are possible in this application.All any modifications made within the spirit and principles of the present application are equal
Replacement, improvement etc., should be included among the interest field of the application.