CN116108387A

CN116108387A - Unbalanced data oversampling method and related equipment

Info

Publication number: CN116108387A
Application number: CN202310397766.7A
Authority: CN
Inventors: 刘利枚; 黄志伟; 刘星宝; 石彪
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-05-12
Anticipated expiration: 2043-04-14
Also published as: CN116108387B

Abstract

The invention provides an unbalanced data oversampling method and related equipment, wherein the method comprises the following steps: acquiring a credit card abnormal transaction data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting a plurality of minority class samples as core sample points, and determining a natural nearest neighbor set and a natural nearest neighbor; calculating the proportion of most samples in each natural nearest neighbor set according to the space distribution condition of the samples in the unbalanced data set; determining the space distribution condition of each core sample point in the unbalanced data set, the quantity weight and the position weight of the generated new sample according to the proportion; acquiring sample characteristics of a new sample according to the quantity weight and the position weight, acquiring a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to acquire a balanced data set for predicting financial fraud; the accuracy of predicting financial fraud is improved.

Description

Unbalanced data oversampling method and related equipment

Technical Field

The invention relates to the technical field of financial unbalanced data processing, in particular to an unbalanced data oversampling method and related equipment.

Background

With the continuous development of artificial intelligence technology, the technology of collecting, storing and processing data is also advancing continuously. Machine learning and data mining techniques that incorporate multiple disciplines have become important methods for analyzing and processing data and converting it into desired knowledge. Conventional machine learning generally assumes that the distribution of data categories is balanced, with the data categories corresponding to a small number of samples. However, in practical situations, data category distribution imbalance is prevalent among various application areas. For example, in credit card fraud detection, fraudulent transactions may account for only 1% of the total transactions, and the algorithm may only need to evaluate all transactions as normal transactions to obtain a classification accuracy of 99%, which ignores the possibility of fraudulent transactions and causes serious damage to businesses and personal properties. Therefore, the balancing treatment for the class unbalance characteristics of the data has extremely high research value and application prospect.

Existing class imbalance processing for data mainly includes oversampling for minority class samples or undersampling for majority class samples, or a combination of both methods. The oversampling refers to a method for achieving data class imbalance by adding a few class samples through a certain method and technology.

The standard Euclidean distance is based on the Euclidean distance, the value of the sample in each dimension is normalized to be expected to be 0, and the variance is 1.

Natural nearest neighbor and natural nearest neighbor refer to the existence of neighbor values

Sample point set

For->

So that->

And->

Is->

The samples are points on the nearest path, then +.>

And->

The sample points are adjacent to each other naturally, the area formed by the connecting lines of the adjacent points becomes the nearest natural neighborhood,

is the natural nearest neighbor value.

At present, most of the existing oversampling methods are based on an SMOTE algorithm, and a method for generating a certain number of minority sample points by randomly selecting minority samples and neighbor samples thereof to conduct linear interpolation; the core of the algorithm is

Nearest neighbor algorithm, which has nearest neighbor ∈>

The value determination is complicated, and the fixation is set>

The value can cause problems such as the quality of the generated sample is reduced; meanwhile, the SOMTE method is insensitive to outliers of few types of samples, and when sample points are selected for linear interpolation, the outliers are easy to obtain, so that a large number of noise samples are generated.

Disclosure of Invention

The invention provides an unbalanced data oversampling method and related equipment, and aims to eliminate interference of outliers on sample characteristics in a balanced data set and improve accuracy of predicting financial fraud.

In order to achieve the above object, the present invention provides a method for oversampling unbalanced data, comprising:

step 1, acquiring a credit card abnormal transaction data set to be processed, wherein the credit card abnormal transaction data set is used as an unbalanced data set, and the unbalanced data set comprises a minority sample set consisting of a plurality of minority samples and a majority sample set consisting of a plurality of majority samples;

step 2, randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; each natural nearest neighbor set comprises a plurality of nearest neighbor elements of a core sample point;

step 3, calculating the proportion of most samples in each natural nearest neighbor set according to the space distribution condition of each sample in the unbalanced data set;

step 4, determining the spatial distribution condition of each core sample point in the unbalanced data set according to the proportion of most samples in each natural nearest neighbor set;

step 5, determining the number weight of the new samples generated in the natural nearest neighbor domain according to the spatial distribution condition of each core sample point in the unbalanced data set;

step 6, determining the position weight of a new sample point generated in each natural nearest neighbor according to the spatial distribution condition of each core sample point in the unbalanced data set;

and 7, acquiring sample characteristics of the new samples generated in each natural nearest neighbor domain according to the number weight and the position weight, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud.

Further, before step 2, the method includes:

the standard Euclidean distance between two minority class samples is calculated as follows:

wherein ,

indicate->

Minority class sample->

And->

Minority class sample->

Distance between (2) and (2)>

、

Respectively represent +.>

Minority class sample->

First->

Minority class sample->

In->

Values in the characteristic dimension of the individual samples, +.>

Representing a minority class sample point set +.>

In->

Standard deviation in the characteristic dimension of individual samples +.>

Is the number of sample features.

Further, step 2 includes:

randomly selecting part of minority class samples in a minority class sample set as core sample points;

selecting the core sample points for each core sample point

Each neighbor element;

selecting the core sample point

The neighboring elements constitute->

Neighbor set->

；

Regarding the minority class samples except the core sample point in the minority class sample set, if the nearest neighbor set of the minority class samples contains the core sample point, the minority class samples are considered to be the inverse of the core sample point

Neighbor element, said inverse->

Neighbor element composition inverse->

Neighbor set->

；

Aiming at the minority class samples except the core sample points in the minority class sample set, if the nearest neighbor set of the minority class samples does not contain the core sample points, the minority class samples are considered to be outliers, and the minority class samples are discarded;

solving for the said

Neighbor set->

Is>

Neighbor set->

Is a complex of the intersection of (a) and (b);

redefining if the intersection is empty

Repeatedly selecting +.>

Neighbor set and inverse->

A neighbor set;

if the intersection is a non-empty set, then the natural nearest neighbor set is

Redefining +.>

Repeatedly find the value of natural nearest neighbor set +.>

；

Up to the inverse of the core sample point

The neighbor set is not changed, and a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set are obtained.

Further, the proportion of the core sample points in most class samples in each natural nearest neighbor set is calculated, and the expression is as follows:

wherein ,

indicating that the core sample point is at +.>

The proportion of most types of samples in the natural nearest neighbor set,

is->

The number of most classes of samples in the natural nearest neighbor set,/->

Representing the number of neighbor elements of the core sample point.

Further, step 4 includes:

according to the proportion of most samples in each natural nearest neighbor set;

if it is

，/>

；

If it is

，/>

；

If it is

，/>

；

wherein ,

sample as core sample pointThe present generates control weights, ++>

For controlling parameters +.>

；

Generating control weights from the samples

The spatial distribution of each core sample point in the unbalanced data set is determined.

Further, the number weight of new samples generated in the natural nearest neighbor

The method comprises the following steps:

wherein ,

generating control weights for samples of core sample points, +.>

Representation->

Samples of core sample points in a natural nearest neighbor generate a sum of control weights.

Further, the position weights of the new sample points generated in the natural nearest neighbor are:

/>

wherein ,

generating control weights for samples of core sample points, +.>

Representation->

Further, step 7 includes:

determining the number of new samples to be generated in the unbalanced data set, wherein the expression is as follows:

wherein ,

for balancing parameters for controlling the number of new samples, +.>

；

The number of new samples to be generated in each natural nearest neighbor is calculated, and the expression is:

generating a formula according to the region sample generation formula for each natural nearest neighbor

Sample characteristics of the new samples, and a regional sample generation formula is as follows:

wherein ,

representing +.>

The first ∈of the new sample point generated>

Sample characteristics,/->

Sample characteristic difference value representing core sample point and other sample points in natural nearest neighbor, and +.>

Is a random number with the value range of 0,1]；

Obtaining a new sample as the sample characteristic of the new sample generated in each natural nearest neighbor domain

New sample->

By->

A sample feature formation;

from the following components

Combining the new samples to obtain a new sample set of +.>

；

And summarizing the new sample set and the unbalanced data set to obtain a balanced data set.

The invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements an unbalanced-like data oversampling method.

The invention also provides a terminal device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, the processor implementing an unbalanced data like oversampling method when executing the computer program.

The scheme of the invention has the following beneficial effects:

the invention uses a credit card abnormal transaction data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; calculating the proportion of the core sample points in most types of samples in each natural nearest neighbor set according to the spatial distribution condition of each sample in the unbalanced data set; according to the proportion of most samples in each natural nearest neighbor set, determining the spatial distribution condition of each core sample point in an unbalanced data set, the number weight of new samples generated in the natural nearest neighbor and the position weight of new sample points generated in the natural nearest neighbor; according to the quantity weight and the position weight, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud; compared with the prior art, the method solves the problem that the neighbor value needs to be frequently determined in the traditional oversampling method by introducing the natural nearest neighbor method, can realize self-adaptive selection of sample adjacent points, eliminates interference of outlier points on sample characteristics in a balance data set, adaptively distributes the number of samples required to be generated according to the distribution state of data around a few sample points in the neighborhood in the formed natural neighbor, improves the quality of the generated samples, enlarges the range of the generated samples, and improves the precision of predicting financial fraud behaviors.

Other advantageous effects of the present invention will be described in detail in the detailed description section which follows.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a flowchart showing step 2 according to an embodiment of the present invention;

FIG. 3 is a flowchart showing steps 3-6 in an embodiment of the present invention;

FIG. 4 is a flowchart showing step 7 according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of identifying outliers according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of natural nearest neighbor and natural neighbor selection of a core sample point according to an embodiment of the present invention;

FIG. 7 shows an embodiment of the present invention

The core sample points are schematic diagrams of outliers;

FIG. 8 shows the following steps in an embodiment of the present invention

Schematic diagram of nearest neighbor element of core sample point;

FIG. 9 is a diagram of an embodiment of the present invention

Schematic diagram of nearest neighbor element of core sample point;

FIG. 10 shows an embodiment of the present invention

Schematic diagram of nearest neighbor element of core sample point;

FIG. 11 is a schematic diagram of a natural nearest neighbor of a core sample point according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of generating a new sample according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, a locked connection, a removable connection, or an integral connection; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In addition, the technical sample features described below in the various embodiments of the invention may be combined with one another as long as they do not conflict with one another.

The invention provides an unbalanced data oversampling method and related equipment aiming at the existing problems.

As shown in fig. 1, an embodiment of the present invention provides a kind of unbalanced data oversampling method, including:

Specifically, step 1 includes: acquiring a pending credit card abnormal transaction data set as an unbalanced data set

Unbalanced data set->

Comprising a minority class sample set consisting of a plurality of minority class samples +.>

And a majority sample set consisting of a plurality of majority samples

And->

，/>

。

Specifically, before step 2, it includes:

calculating a standard Euclidean distance between two minority class samples, the distance set being denoted as

，

Wherein few classes of samples->

The distance set for the other minority class samples is +.>

The standard Euclidean distance formula is as follows:

wherein ,

indicate->

Minority class sample->

And->

Minority class sample->

Distance between (2) and (2)>

、

Respectively represent +.>

Minority class sample->

First->

Minority class sample->

In->

The values in the dimensions of the individual features,

representing a minority class sample point set +.>

In->

Standard deviation in individual characteristic dimensions +.>

Is the number of sample features.

Specifically, as shown in fig. 2, step 2 includes:

for each core sample point, selecting a core sample point

Each neighbor element;

selecting core sample points

The neighboring elements constitute->

Neighbor set->

；

For a minority class of samples in the minority class of sample set except for the core sample point, if the nearest neighbor set of the minority class of samples contains the core sample point,the minority class samples are considered as the inverse of the core sample points

Neighbor element, reverse->

Neighbor element composition inverse->

Neighbor set->

；

Aiming at a minority class sample except a core sample point in a minority class sample set, if a nearest neighbor set of the minority class sample does not contain the core sample point, the minority class sample is considered to be an outlier, and the minority class sample is discarded;

obtaining

Neighbor set->

And reverse->

Neighbor set->

Is a complex of the intersection of (a) and (b);

redefining if the intersection is empty

Repeatedly selecting +.>

Neighbor set and inverse->

A neighbor set;

if the intersection is a non-empty set, the natural nearest neighbor set is

Redefining +.>

Repeatedly find the value of natural nearest neighbor set +.>

；

Up to the inverse of the core sample point

In the embodiment of the invention, the number of neighbor elements is initialized

；

In the distance set between the core sample point and the adjacent element, sequentially selecting from small to large

The nearest neighbor element with the smallest distance value is selected as the first nearest neighbor element to form a nearest neighbor set which does not contain the core sample point, such as the core sample point +.>

Is->

Neighbor set->

；

For the current

At this point, if the nearest neighbor set of the minority class samples other than the core sample point contains the core sample point +.>

The minority classThe sample is core sample point->

Is>

Neighbor elements, element set is recorded as

If the core sample point->

No adverse qi->

Nearest neighbor, then define the number of nearest neighbor elements +.>

Repeating the two steps, if the point still has no reverse neighbor, judging the point as an outlier point, discarding the minority class samples, and reselecting a core sample point;

finding core sample points

Is->

Neighbor set->

And reverse->

Neighbor set->

Is the intersection of natural nearest neighbors->

I.e. +.>

；

Judging the inverse

Neighbor set->

Whether to increase; if you are reverse->

Neighbor set->

The neighbor element in the middle is increased or is +.>

Define +.>

Repeating the steps of the 3 steps; if not, core sample point->

Corresponding to natural nearest neighbor of ∈>

The corresponding natural neighborhood is a space inner region formed by natural nearest neighbor set elements; />

And repeatedly searching the unbalanced data set to obtain a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to the natural nearest neighbor set.

Specifically, as shown in fig. 3, step 3 includes:

selecting different neighbor elements, and calculating core sample points in sample space of whole unbalanced data set

The ratio of a plurality of types of samples in the natural nearest neighbor set of the core sample point is +.>

The calculation formula of (2) is as follows:

wherein ,

indicating that the core sample point is at +.>

The proportion of most types of samples in the natural nearest neighbor set,

is->

The number of most classes of samples in the natural nearest neighbor set,/->

Representing the number of neighbor elements of the core sample point.

Specifically, step 4 includes:

increasing the data generation weight of the core sample points with more sample points of most types in the natural nearest neighbor set, namely

If it is

，/>

；

If it is

，/>

；

If it is

，/>

；

wherein ,

generating control weights for samples of core sample points, +.>

For controlling parameters +.>

；

Generating control weights from samples

Specifically, the number weight of minority class samples generated in natural nearest neighbor

The method comprises the following steps:

wherein ,

generating control weights for samples of core sample points, +.>

Representation->

Specifically, the location weights of the minority class sample points generated in the natural nearest neighbor are:

wherein ,

generating control weights for samples of core sample points, +.>

Representation->

Specifically, as shown in fig. 4, step 7 includes:

wherein ,

for balancing parameters for controlling the number of new samples, +.>

；

Sample characteristics of the new samples, the regional sample generation formula is:

wherein ,

representing +.>

The first ∈of the new sample point generated>

Sample characteristics,/->

Is a random number with the value range of 0,1]；

New sample->

By->

A sample feature formation;

from the following components

Combining the new samples to obtain a new sample set of +.>

；

Specifically, with respect to the identification and discarding of outliers, as shown in FIGS. 5 and 6, when the core sample point is an outlier

When (I)>

Point->

The nearest neighbor element of (2) is sample->

Sample->

The nearest neighbor element of (2) is sample->

Thus core sample point

Does not have the reverse->

A neighbor element;

redefinition

Circulating;

when (when)

At this time, as shown in FIG. 7, core sample point +.>

The nearest neighbor element of (2) is sample->

Sample->

And sample point->

The nearest neighbor element of (2) is sample->

Sample->

Sample->

Is the nearest neighbor element of the sample/>

Sample->

Therefore, about core sample point->

Is still an empty set, so core sample points are identified +.>

Is an outlier.

As shown in FIG. 8, when the core sample point is

，/>

The nearest neighbor element of the core sample point is sample +.>

Samples of

The nearest neighbor element of (2) is sample->

Therefore, sample->

For core sample point->

Is>

Neighboring elements, and at core sample points

Is the nearest neighbor set of (1), so sample +.>

For core sample point->

Defining +.>

Carrying out the next step;

when (when)

At this time, as shown in FIG. 9, core sample point +.>

The nearest neighbor element of (2) is sample->

Sample->

Sample->

The nearest neighbor element of (2) is the core sample point +.>

Sample->

Sample->

The nearest neighbor element of (2) is the core sample point +.>

Sample->

Therefore, sample->

Sample->

For core sample point->

Defining +.>

Carrying out the next step;

when (when)

At this time, as shown in FIG. 10, core sample point +.>

The nearest neighbor element of (2) is sample->

Sample->

Sample->

Sample->

The nearest neighbor element of (2) is the core sample point +.>

Sample->

Sample->

Sample->

The nearest neighbor element of (2) is the core sample point +.>

Sample->

Sample->

Sample->

The nearest neighbor element of (2) is sample->

Sample->

Sample->

Core sample point->

Natural reverse->

The neighbor set is unchanged, core sample point +.>

Is +.>

、/>

The natural nearest neighbor is shown in FIG. 11;

determining the natural nearest neighbor set and the natural nearest field of the residual core sample points, solving the generation quantity weight and the sample generation weight of the sample points in the respective natural nearest field, and generating according to the quantity weight, the position weight and the regional sample generation formula

Sample characteristics of the new samples, a new minority class of samples is generated, as shown in fig. 12.

In the embodiment of the invention, an unbalanced data set is obtained for example, and the unbalanced data set is classified into a class ratio of 12:1, a credit card abnormal transaction data set;

step 2, randomly selecting core sample points

=[1.2023，-0.6947，-5.5263，6.6624，-8.5255，0.7427，-7.6787]Specifically, trade characteristics= [ regional economy information, social status information, trade time, trade amount period, geographical position, time difference of geographical position, trade amount]Because of the privacy of the financial data, embodiments of the present invention desensitize it;

first calculate core sample points

Distance from other sample points, select +.>

，/>

The nearest neighbor element of (2) is sample->

=[1.2498，-0.7183，-5.3903，6.4542，-8.4853，0.6353，-7.0199]Sample->

The nearest neighbor element of (2) is the core sample point +.>

Therefore, sample->

For core sample point->

Natural reverse->

Neighbor elements, definition

Circulating;

core sample Point->

The nearest neighbor element of (2) is sample->

Sample->

Sample->

=[1.7035，-1.3053，-6.7167，6.3536，-8.6016，0.4499，-7.5062]Sample->

The nearest neighbor element of (2) is sample->

Sample->

Therefore, sample->

For core sample point->

Natural reverse->

Neighbor element, definition->

Circulating;

core sample Point->

The nearest neighbor element of (2) is sample->

Sample->

Sample->

Sample->

=[1.7017，-1.4394，-6.9999，6.3162，-8.6708，0.316，-7.4177]Sample->

The nearest neighbor element of (2) is sample->

Sample->

Sample->

Therefore, sample->

For core sample point->

Natural reverse->

Neighbor elements, definition

Circulating;

core sample Point->

The nearest neighbor element of (2) is sample->

Sample->

Sample->

Sample->

Sample->

=[1.5156，-1.2072，-6.2346，5.4507，-7.3337，1.3612，-6.6081]Sample->

The nearest neighbor element of (2) is sample->

Sample->

Sample->

Sample->

Therefore, sample->

Not core sample point->

Is>

Neighbor element, so core sample point->

Is { +.>

，/>

，/>

Natural nearest neighbor is +.>

Area formed by connecting lines between departure points +.>

，/>

；

Step 3: first, the proportion of most types of samples in the natural nearest neighbor set of each core sample point is calculated, wherein the core sample points

The proportion of most types of samples in the natural nearest neighbor set is +.>

，

So sample generation control weight +.>

；/>

Step 4, based on the weight of other core sample points, the method is represented by the formula

Obtaining, the number weight of minority class samples generated in the natural nearest neighbor +.>

；

Step 5, calculating core sample points

Natural nearest neighbor element->

，/>

，/>

The proportion of most classes of samples in the natural nearest neighbor set of (1), wherein +.>

，/>

，/>

Therefore, it is

，/>

By the formula->

Obtaining the product

，/>

；

Step 6, firstly determining the number of samples to be generated according to the formula

，/>

Default to 1, get->

；

From the formula

Available core sample Point->

The number of samples to be generated is +.>

；

From the formula

，

A new sample may be obtained as [1.0732, -0.504, -5.1509,6.7533, -8.4891,0.8524, -7.7515]；

The new sample set is

，/>

The specific data of (2) are as follows:

{1.0732，-0.504，-5.1509，6.7533，-8.4891，0.8524，-7.7515

1.1313，-0.5899，-5.3199，6.7124，-8.5055，0.803，-7.7187

1.1397，-0.6022，-5.3443，6.7065，-8.5078，0.7959，-7.714

……

1.1074，-0.5546，-5.2505，6.7292，-8.4988，0.8233，-7.7322}。

the embodiment of the invention takes a credit card abnormal data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; calculating the proportion of the core sample points in most types of samples in each natural nearest neighbor set according to the spatial distribution condition of each sample in the unbalanced data set; according to the proportion of most samples in each natural nearest neighbor set, determining the spatial distribution condition of each core sample point in an unbalanced data set, the number weight of new samples generated in the natural nearest neighbor and the position weight of new sample points generated in the natural nearest neighbor; according to the quantity weight and the position weight, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain, acquiring a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to acquire a balanced data set for predicting financial fraud; compared with the prior art, the method solves the problem that the neighbor value needs to be frequently determined in the traditional oversampling method by introducing the natural nearest neighbor method, can realize self-adaptive selection of sample adjacent points, eliminates interference of outlier points on sample characteristics in a balance data set, adaptively distributes the number of samples required to be generated according to the distribution state of data around a few sample points in the neighborhood in the formed natural neighbor, improves the quality of the generated samples, enlarges the range of the generated samples, and improves the precision of predicting financial fraud behaviors.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the unbalanced data like oversampling method when being executed by a processor.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the implementation of all or part of the flow of the method of the foregoing embodiments of the present invention may be accomplished by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the foregoing method embodiments when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to construct an apparatus/terminal equipment, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The embodiment of the invention also provides a terminal device which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the unbalanced data like oversampling method when executing the computer program.

It should be noted that the terminal device may be a mobile phone, a tablet computer, a notebook computer, an Ultra mobile personal computer (UMPC, ultra-mobile Personal Computer), a netbook, a personal digital assistant (PDA, personal Digital Assistant), or the like, and the terminal device may be a station (ST, stand) in a WLAN, for example, a cellular phone, a cordless phone, a session initiation protocol (SIP, session Initiation Protocol) phone, a wireless local loop (WLL, wireless Local Loop) station, a personal digital processing (PDA, personal Digital Assistant) device, a handheld device having a wireless communication function, a computing device, or other processing device connected to a wireless modem, a computer, a laptop computer, a handheld communication device, a handheld computing device, a satellite wireless device, or the like. The embodiment of the invention does not limit the specific type of the terminal equipment.

The processor may be a central processing unit (CPU, central Processing Unit), but may also be other general purpose processors, digital signal processors (DSP, digital Signal Processor), application specific integrated circuits (ASIC, application Specific Integrated Circuit), off-the-shelf programmable gate arrays (FPGA, field-Programmable Gate Array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may in some embodiments be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may in other embodiments also be an external storage device of the terminal device, such as a plug-in hard disk provided on the terminal device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. Further, the memory may also include both an internal storage unit and an external storage device of the terminal device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs, etc., such as program code for the computer program, etc. The memory may also be used to temporarily store data that has been output or is to be output.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be found in the method embodiment section, and will not be described herein.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A method for oversampling unbalanced data, comprising:

step 2, randomly selecting part of minority class samples in the minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; each of the natural nearest neighbor sets includes a plurality of nearest neighbor elements of the core sample point;

step 3, calculating the proportion of the majority sample in each natural nearest neighbor set according to the space distribution condition of each sample in the unbalanced data set;

step 4, determining the spatial distribution condition of each core sample point in the unbalanced data set according to the proportion of the majority sample in each natural nearest neighbor set;

and 7, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain according to the quantity weight and the position weight, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud.

2. The method of oversampling data in class unbalance according to claim 1, comprising, before said step 2:

and calculating the standard Euclidean distance between the two minority class samples, wherein the formula is as follows:

wherein ,

indicate->

Minority class sample->

And->

Minority class sample->

Distance between (2) and (2)>

、/>

Respectively represent +.>

Minority class sample->

First->

Minority class sample->

In->

The values in the dimensions of the features of the individual samples,

representing a minority class sample point set +.>

In->

Standard deviation in the characteristic dimension of individual samples +.>

Is the number of sample features.

3. The unbalanced-like data oversampling method of claim 2, wherein step 2 comprises:

randomly selecting a plurality of minority class samples in the minority class sample set as core sample points;

selecting the core sample points for each core sample point

Each neighbor element;

selecting the core sample point

The neighboring elements constitute->

Neighbor set->

；

Neighbor element, said inverse->

Neighbor element composition inverse->

Neighbor set->

；

solving for the said

Neighbor set->

Is>

Neighbor set->

Is a complex of the intersection of (a) and (b);

redefining if the intersection is empty

Repeatedly selecting +.>

Neighbor set and inverse

A neighbor set;

Redefinition of

Repeatedly find the value of natural nearest neighbor set +.>

；

Up to the core sampleInverse of this point

And obtaining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set without changing the neighbor set.

4. A method of oversampling class-unbalanced data as claimed in claim 3, wherein the proportion of the majority class samples in each of the natural nearest neighbor sets is calculated by:

wherein ,

indicating that the majority class sample is at +.>

The proportion of the natural nearest neighbor set, +.>

Is->

The number of most classes of samples in the natural nearest neighbor set,/->

Representing the number of neighbor elements of the core sample point.

5. The method of oversampling data in class unbalance of claim 4, wherein the step 4 comprises:

according to the proportion of the majority sample in each natural nearest neighbor set;

if it is

，/>

；

If it is

，/>

；

If it is

，/>

；

wherein ,

generating control weights for samples of core sample points, +.>

For controlling parameters +.>

；

Generating control weights from the samples

Determining the spatial distribution of each core sample point in the unbalanced data set.

6. The method of claim 5, wherein the number weights of new samples generated in the natural nearest neighbor are based on a number of the new samples

The method comprises the following steps:

wherein ,

generating control weights for samples of core sample points, +.>

Representation->

7. The method of claim 6, wherein the location weights of the new samples generated in the natural nearest neighbor are:

wherein ,

generating control weights for samples of core sample points, +.>

Representation->

8. The method of oversampling of data in class unbalance of claim 7, wherein the step 7 comprises:

determining the number of new samples to be generated in the unbalanced dataset, wherein the expression is:

wherein ,

for balancing parameters for controlling the number of new samples, +.>

；

Calculating the number of new samples required to be generated in each natural nearest neighbor domain, wherein the expression is as follows:

generating a formula according to the region sample generation for each natural nearest neighbor

wherein ,

representing +.>

The first ∈of the new sample point generated>

Sample characteristics,/->

Representing core sample points and in natural nearest neighborsSample characteristic differences of other sample points, +.>

Is a random number with the value range of 0,1]；

New sample->

By->

A sample feature formation;

from the following components

Combining the new samples to obtain a new sample set of +.>

；

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the unbalance-like data oversampling method according to any of the claims 1 to 7.

10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the unbalance-like data oversampling method according to any one of claims 1 to 7 when executing the computer program.