CN116108387A - Unbalanced data oversampling method and related equipment - Google Patents
Unbalanced data oversampling method and related equipment Download PDFInfo
- Publication number
- CN116108387A CN116108387A CN202310397766.7A CN202310397766A CN116108387A CN 116108387 A CN116108387 A CN 116108387A CN 202310397766 A CN202310397766 A CN 202310397766A CN 116108387 A CN116108387 A CN 116108387A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- nearest neighbor
- natural
- core sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000002159 abnormal effect Effects 0.000 claims abstract description 11
- 238000004590 computer program Methods 0.000 claims description 18
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an unbalanced data oversampling method and related equipment, wherein the method comprises the following steps: acquiring a credit card abnormal transaction data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting a plurality of minority class samples as core sample points, and determining a natural nearest neighbor set and a natural nearest neighbor; calculating the proportion of most samples in each natural nearest neighbor set according to the space distribution condition of the samples in the unbalanced data set; determining the space distribution condition of each core sample point in the unbalanced data set, the quantity weight and the position weight of the generated new sample according to the proportion; acquiring sample characteristics of a new sample according to the quantity weight and the position weight, acquiring a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to acquire a balanced data set for predicting financial fraud; the accuracy of predicting financial fraud is improved.
Description
Technical Field
The invention relates to the technical field of financial unbalanced data processing, in particular to an unbalanced data oversampling method and related equipment.
Background
With the continuous development of artificial intelligence technology, the technology of collecting, storing and processing data is also advancing continuously. Machine learning and data mining techniques that incorporate multiple disciplines have become important methods for analyzing and processing data and converting it into desired knowledge. Conventional machine learning generally assumes that the distribution of data categories is balanced, with the data categories corresponding to a small number of samples. However, in practical situations, data category distribution imbalance is prevalent among various application areas. For example, in credit card fraud detection, fraudulent transactions may account for only 1% of the total transactions, and the algorithm may only need to evaluate all transactions as normal transactions to obtain a classification accuracy of 99%, which ignores the possibility of fraudulent transactions and causes serious damage to businesses and personal properties. Therefore, the balancing treatment for the class unbalance characteristics of the data has extremely high research value and application prospect.
Existing class imbalance processing for data mainly includes oversampling for minority class samples or undersampling for majority class samples, or a combination of both methods. The oversampling refers to a method for achieving data class imbalance by adding a few class samples through a certain method and technology.
The standard Euclidean distance is based on the Euclidean distance, the value of the sample in each dimension is normalized to be expected to be 0, and the variance is 1.
Natural nearest neighbor and natural nearest neighbor refer to the existence of neighbor valuesSample point setFor->So that->And->Is->The samples are points on the nearest path, then +.>And->The sample points are adjacent to each other naturally, the area formed by the connecting lines of the adjacent points becomes the nearest natural neighborhood,is the natural nearest neighbor value.
At present, most of the existing oversampling methods are based on an SMOTE algorithm, and a method for generating a certain number of minority sample points by randomly selecting minority samples and neighbor samples thereof to conduct linear interpolation; the core of the algorithm isNearest neighbor algorithm, which has nearest neighbor ∈>The value determination is complicated, and the fixation is set>The value can cause problems such as the quality of the generated sample is reduced; meanwhile, the SOMTE method is insensitive to outliers of few types of samples, and when sample points are selected for linear interpolation, the outliers are easy to obtain, so that a large number of noise samples are generated.
Disclosure of Invention
The invention provides an unbalanced data oversampling method and related equipment, and aims to eliminate interference of outliers on sample characteristics in a balanced data set and improve accuracy of predicting financial fraud.
In order to achieve the above object, the present invention provides a method for oversampling unbalanced data, comprising:
step 2, randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; each natural nearest neighbor set comprises a plurality of nearest neighbor elements of a core sample point;
step 3, calculating the proportion of most samples in each natural nearest neighbor set according to the space distribution condition of each sample in the unbalanced data set;
step 4, determining the spatial distribution condition of each core sample point in the unbalanced data set according to the proportion of most samples in each natural nearest neighbor set;
and 7, acquiring sample characteristics of the new samples generated in each natural nearest neighbor domain according to the number weight and the position weight, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud.
Further, before step 2, the method includes:
the standard Euclidean distance between two minority class samples is calculated as follows:
wherein ,indicate->Minority class sample->And->Minority class sample->Distance between (2) and (2)>、Respectively represent +.>Minority class sample->First->Minority class sample->In->Values in the characteristic dimension of the individual samples, +.>Representing a minority class sample point set +.>In->Standard deviation in the characteristic dimension of individual samples +.>Is the number of sample features.
Further, step 2 includes:
randomly selecting part of minority class samples in a minority class sample set as core sample points;
Regarding the minority class samples except the core sample point in the minority class sample set, if the nearest neighbor set of the minority class samples contains the core sample point, the minority class samples are considered to be the inverse of the core sample pointNeighbor element, said inverse->Neighbor element composition inverse->Neighbor set->;
Aiming at the minority class samples except the core sample points in the minority class sample set, if the nearest neighbor set of the minority class samples does not contain the core sample points, the minority class samples are considered to be outliers, and the minority class samples are discarded;
redefining if the intersection is emptyRepeatedly selecting +.>Neighbor set and inverse->A neighbor set;
if the intersection is a non-empty set, then the natural nearest neighbor set isRedefining +.>Repeatedly find the value of natural nearest neighbor set +.>;
Up to the inverse of the core sample pointThe neighbor set is not changed, and a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set are obtained.
Further, the proportion of the core sample points in most class samples in each natural nearest neighbor set is calculated, and the expression is as follows:
wherein ,indicating that the core sample point is at +.>The proportion of most types of samples in the natural nearest neighbor set,is->The number of most classes of samples in the natural nearest neighbor set,/->Representing the number of neighbor elements of the core sample point.
Further, step 4 includes:
according to the proportion of most samples in each natural nearest neighbor set;
wherein ,sample as core sample pointThe present generates control weights, ++>For controlling parameters +.>;
Generating control weights from the samplesThe spatial distribution of each core sample point in the unbalanced data set is determined.
Further, the number weight of new samples generated in the natural nearest neighborThe method comprises the following steps:
wherein ,generating control weights for samples of core sample points, +.>Representation->Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
Further, the position weights of the new sample points generated in the natural nearest neighbor are:
wherein ,generating control weights for samples of core sample points, +.>Representation->Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
Further, step 7 includes:
determining the number of new samples to be generated in the unbalanced data set, wherein the expression is as follows:
The number of new samples to be generated in each natural nearest neighbor is calculated, and the expression is:
generating a formula according to the region sample generation formula for each natural nearest neighborSample characteristics of the new samples, and a regional sample generation formula is as follows:
wherein ,representing +.>The first ∈of the new sample point generated>Sample characteristics,/->Sample characteristic difference value representing core sample point and other sample points in natural nearest neighbor, and +.>Is a random number with the value range of 0,1];
Obtaining a new sample as the sample characteristic of the new sample generated in each natural nearest neighbor domainNew sample->By->A sample feature formation;
And summarizing the new sample set and the unbalanced data set to obtain a balanced data set.
The invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements an unbalanced-like data oversampling method.
The invention also provides a terminal device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, the processor implementing an unbalanced data like oversampling method when executing the computer program.
The scheme of the invention has the following beneficial effects:
the invention uses a credit card abnormal transaction data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; calculating the proportion of the core sample points in most types of samples in each natural nearest neighbor set according to the spatial distribution condition of each sample in the unbalanced data set; according to the proportion of most samples in each natural nearest neighbor set, determining the spatial distribution condition of each core sample point in an unbalanced data set, the number weight of new samples generated in the natural nearest neighbor and the position weight of new sample points generated in the natural nearest neighbor; according to the quantity weight and the position weight, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud; compared with the prior art, the method solves the problem that the neighbor value needs to be frequently determined in the traditional oversampling method by introducing the natural nearest neighbor method, can realize self-adaptive selection of sample adjacent points, eliminates interference of outlier points on sample characteristics in a balance data set, adaptively distributes the number of samples required to be generated according to the distribution state of data around a few sample points in the neighborhood in the formed natural neighbor, improves the quality of the generated samples, enlarges the range of the generated samples, and improves the precision of predicting financial fraud behaviors.
Other advantageous effects of the present invention will be described in detail in the detailed description section which follows.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a flowchart showing step 2 according to an embodiment of the present invention;
FIG. 3 is a flowchart showing steps 3-6 in an embodiment of the present invention;
FIG. 4 is a flowchart showing step 7 according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of identifying outliers according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of natural nearest neighbor and natural neighbor selection of a core sample point according to an embodiment of the present invention;
FIG. 7 shows an embodiment of the present inventionThe core sample points are schematic diagrams of outliers;
FIG. 8 shows the following steps in an embodiment of the present inventionSchematic diagram of nearest neighbor element of core sample point;
FIG. 9 is a diagram of an embodiment of the present inventionSchematic diagram of nearest neighbor element of core sample point;
FIG. 10 shows an embodiment of the present inventionSchematic diagram of nearest neighbor element of core sample point;
FIG. 11 is a schematic diagram of a natural nearest neighbor of a core sample point according to an embodiment of the present invention;
FIG. 12 is a schematic diagram of generating a new sample according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, a locked connection, a removable connection, or an integral connection; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, the technical sample features described below in the various embodiments of the invention may be combined with one another as long as they do not conflict with one another.
The invention provides an unbalanced data oversampling method and related equipment aiming at the existing problems.
As shown in fig. 1, an embodiment of the present invention provides a kind of unbalanced data oversampling method, including:
step 2, randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; each natural nearest neighbor set comprises a plurality of nearest neighbor elements of a core sample point;
step 3, calculating the proportion of most samples in each natural nearest neighbor set according to the space distribution condition of each sample in the unbalanced data set;
step 4, determining the spatial distribution condition of each core sample point in the unbalanced data set according to the proportion of most samples in each natural nearest neighbor set;
and 7, acquiring sample characteristics of the new samples generated in each natural nearest neighbor domain according to the number weight and the position weight, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud.
Specifically, step 1 includes: acquiring a pending credit card abnormal transaction data set as an unbalanced data setUnbalanced data set->Comprising a minority class sample set consisting of a plurality of minority class samples +.>And a majority sample set consisting of a plurality of majority samplesAnd->,/>。
Specifically, before step 2, it includes:
calculating a standard Euclidean distance between two minority class samples, the distance set being denoted as,Wherein few classes of samples->The distance set for the other minority class samples is +.>The standard Euclidean distance formula is as follows:
wherein ,indicate->Minority class sample->And->Minority class sample->Distance between (2) and (2)>、Respectively represent +.>Minority class sample->First->Minority class sample->In->The values in the dimensions of the individual features,representing a minority class sample point set +.>In->Standard deviation in individual characteristic dimensions +.>Is the number of sample features.
Specifically, as shown in fig. 2, step 2 includes:
randomly selecting part of minority class samples in a minority class sample set as core sample points;
For a minority class of samples in the minority class of sample set except for the core sample point, if the nearest neighbor set of the minority class of samples contains the core sample point,the minority class samples are considered as the inverse of the core sample pointsNeighbor element, reverse->Neighbor element composition inverse->Neighbor set->;
Aiming at a minority class sample except a core sample point in a minority class sample set, if a nearest neighbor set of the minority class sample does not contain the core sample point, the minority class sample is considered to be an outlier, and the minority class sample is discarded;
redefining if the intersection is emptyRepeatedly selecting +.>Neighbor set and inverse->A neighbor set;
if the intersection is a non-empty set, the natural nearest neighbor set isRedefining +.>Repeatedly find the value of natural nearest neighbor set +.>;
Up to the inverse of the core sample pointThe neighbor set is not changed, and a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set are obtained.
In the distance set between the core sample point and the adjacent element, sequentially selecting from small to largeThe nearest neighbor element with the smallest distance value is selected as the first nearest neighbor element to form a nearest neighbor set which does not contain the core sample point, such as the core sample point +.>Is->Neighbor set->;
For the currentAt this point, if the nearest neighbor set of the minority class samples other than the core sample point contains the core sample point +.>The minority classThe sample is core sample point->Is>Neighbor elements, element set is recorded asIf the core sample point->No adverse qi->Nearest neighbor, then define the number of nearest neighbor elements +.>Repeating the two steps, if the point still has no reverse neighbor, judging the point as an outlier point, discarding the minority class samples, and reselecting a core sample point;
finding core sample pointsIs->Neighbor set->And reverse->Neighbor set->Is the intersection of natural nearest neighbors->I.e. +.>;
Judging the inverseNeighbor set->Whether to increase; if you are reverse->Neighbor set->The neighbor element in the middle is increased or is +.>Define +.>Repeating the steps of the 3 steps; if not, core sample point->Corresponding to natural nearest neighbor of ∈>The corresponding natural neighborhood is a space inner region formed by natural nearest neighbor set elements; />
And repeatedly searching the unbalanced data set to obtain a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to the natural nearest neighbor set.
Specifically, as shown in fig. 3, step 3 includes:
selecting different neighbor elements, and calculating core sample points in sample space of whole unbalanced data setThe ratio of a plurality of types of samples in the natural nearest neighbor set of the core sample point is +.>The calculation formula of (2) is as follows:
wherein ,indicating that the core sample point is at +.>The proportion of most types of samples in the natural nearest neighbor set,is->The number of most classes of samples in the natural nearest neighbor set,/->Representing the number of neighbor elements of the core sample point.
Specifically, step 4 includes:
according to the proportion of most samples in each natural nearest neighbor set;
increasing the data generation weight of the core sample points with more sample points of most types in the natural nearest neighbor set, namely
wherein ,generating control weights for samples of core sample points, +.>For controlling parameters +.>;
Generating control weights from samplesThe spatial distribution of each core sample point in the unbalanced data set is determined.
Specifically, the number weight of minority class samples generated in natural nearest neighborThe method comprises the following steps:
wherein ,generating control weights for samples of core sample points, +.>Representation->Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
Specifically, the location weights of the minority class sample points generated in the natural nearest neighbor are:
wherein ,generating control weights for samples of core sample points, +.>Representation->Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
Specifically, as shown in fig. 4, step 7 includes:
determining the number of new samples to be generated in the unbalanced data set, wherein the expression is as follows:
The number of new samples to be generated in each natural nearest neighbor is calculated, and the expression is:
generating a formula according to the region sample generation formula for each natural nearest neighborSample characteristics of the new samples, the regional sample generation formula is:
wherein ,representing +.>The first ∈of the new sample point generated>Sample characteristics,/->Sample characteristic difference value representing core sample point and other sample points in natural nearest neighbor, and +.>Is a random number with the value range of 0,1];
Obtaining a new sample as the sample characteristic of the new sample generated in each natural nearest neighbor domainNew sample->By->A sample feature formation;
And summarizing the new sample set and the unbalanced data set to obtain a balanced data set.
Specifically, with respect to the identification and discarding of outliers, as shown in FIGS. 5 and 6, when the core sample point is an outlierWhen (I)>Point->The nearest neighbor element of (2) is sample->Sample->The nearest neighbor element of (2) is sample->Thus core sample pointDoes not have the reverse->A neighbor element;
when (when)At this time, as shown in FIG. 7, core sample point +.>The nearest neighbor element of (2) is sample->Sample->And sample point->The nearest neighbor element of (2) is sample->Sample->Sample->Is the nearest neighbor element of the sample/>Sample->Therefore, about core sample point->Is still an empty set, so core sample points are identified +.>Is an outlier.
As shown in FIG. 8, when the core sample point is,/>The nearest neighbor element of the core sample point is sample +.>Samples ofThe nearest neighbor element of (2) is sample->Therefore, sample->For core sample point->Is>Neighboring elements, and at core sample pointsIs the nearest neighbor set of (1), so sample +.>For core sample point->Defining +.>Carrying out the next step;
when (when)At this time, as shown in FIG. 9, core sample point +.>The nearest neighbor element of (2) is sample->Sample->Sample->The nearest neighbor element of (2) is the core sample point +.>Sample->Sample->The nearest neighbor element of (2) is the core sample point +.>Sample->Therefore, sample->Sample->For core sample point->Defining +.>Carrying out the next step;
when (when)At this time, as shown in FIG. 10, core sample point +.>The nearest neighbor element of (2) is sample->Sample->Sample->Sample->The nearest neighbor element of (2) is the core sample point +.>Sample->Sample->Sample->The nearest neighbor element of (2) is the core sample point +.>Sample->Sample->Sample->The nearest neighbor element of (2) is sample->Sample->Sample->Core sample point->Natural reverse->The neighbor set is unchanged, core sample point +.>Is +.>、/>The natural nearest neighbor is shown in FIG. 11;
determining the natural nearest neighbor set and the natural nearest field of the residual core sample points, solving the generation quantity weight and the sample generation weight of the sample points in the respective natural nearest field, and generating according to the quantity weight, the position weight and the regional sample generation formulaSample characteristics of the new samples, a new minority class of samples is generated, as shown in fig. 12.
In the embodiment of the invention, an unbalanced data set is obtained for example, and the unbalanced data set is classified into a class ratio of 12:1, a credit card abnormal transaction data set;
step 2, randomly selecting core sample points=[1.2023,-0.6947,-5.5263,6.6624,-8.5255,0.7427,-7.6787]Specifically, trade characteristics= [ regional economy information, social status information, trade time, trade amount period, geographical position, time difference of geographical position, trade amount]Because of the privacy of the financial data, embodiments of the present invention desensitize it;
first calculate core sample pointsDistance from other sample points, select +.>,/>The nearest neighbor element of (2) is sample->=[1.2498,-0.7183,-5.3903,6.4542,-8.4853,0.6353,-7.0199]Sample->The nearest neighbor element of (2) is the core sample point +.>Therefore, sample->For core sample point->Natural reverse->Neighbor elements, definitionCirculating;
core sample Point->The nearest neighbor element of (2) is sample->Sample->Sample->=[1.7035,-1.3053,-6.7167,6.3536,-8.6016,0.4499,-7.5062]Sample->The nearest neighbor element of (2) is sample->Sample->Therefore, sample->For core sample point->Natural reverse->Neighbor element, definition->Circulating;
core sample Point->The nearest neighbor element of (2) is sample->Sample->Sample->Sample->=[1.7017,-1.4394,-6.9999,6.3162,-8.6708,0.316,-7.4177]Sample->The nearest neighbor element of (2) is sample->Sample->Sample->Therefore, sample->For core sample point->Natural reverse->Neighbor elements, definitionCirculating;
core sample Point->The nearest neighbor element of (2) is sample->Sample->Sample->Sample->Sample->=[1.5156,-1.2072,-6.2346,5.4507,-7.3337,1.3612,-6.6081]Sample->The nearest neighbor element of (2) is sample->Sample->Sample->Sample->Therefore, sample->Not core sample point->Is>Neighbor element, so core sample point->Is { +.>,/>,/>Natural nearest neighbor is +.>Area formed by connecting lines between departure points +.>,/>;
Step 3: first, the proportion of most types of samples in the natural nearest neighbor set of each core sample point is calculated, wherein the core sample pointsThe proportion of most types of samples in the natural nearest neighbor set is +.>,So sample generation control weight +.>;/>
Step 4, based on the weight of other core sample points, the method is represented by the formulaObtaining, the number weight of minority class samples generated in the natural nearest neighbor +.>;
From the formula,A new sample may be obtained as [1.0732, -0.504, -5.1509,6.7533, -8.4891,0.8524, -7.7515];
{1.0732,-0.504,-5.1509,6.7533,-8.4891,0.8524,-7.7515
1.1313,-0.5899,-5.3199,6.7124,-8.5055,0.803,-7.7187
1.1397,-0.6022,-5.3443,6.7065,-8.5078,0.7959,-7.714
……
1.1074,-0.5546,-5.2505,6.7292,-8.4988,0.8233,-7.7322}。
the embodiment of the invention takes a credit card abnormal data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; calculating the proportion of the core sample points in most types of samples in each natural nearest neighbor set according to the spatial distribution condition of each sample in the unbalanced data set; according to the proportion of most samples in each natural nearest neighbor set, determining the spatial distribution condition of each core sample point in an unbalanced data set, the number weight of new samples generated in the natural nearest neighbor and the position weight of new sample points generated in the natural nearest neighbor; according to the quantity weight and the position weight, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain, acquiring a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to acquire a balanced data set for predicting financial fraud; compared with the prior art, the method solves the problem that the neighbor value needs to be frequently determined in the traditional oversampling method by introducing the natural nearest neighbor method, can realize self-adaptive selection of sample adjacent points, eliminates interference of outlier points on sample characteristics in a balance data set, adaptively distributes the number of samples required to be generated according to the distribution state of data around a few sample points in the neighborhood in the formed natural neighbor, improves the quality of the generated samples, enlarges the range of the generated samples, and improves the precision of predicting financial fraud behaviors.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the unbalanced data like oversampling method when being executed by a processor.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the implementation of all or part of the flow of the method of the foregoing embodiments of the present invention may be accomplished by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the foregoing method embodiments when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to construct an apparatus/terminal equipment, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The embodiment of the invention also provides a terminal device which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the unbalanced data like oversampling method when executing the computer program.
It should be noted that the terminal device may be a mobile phone, a tablet computer, a notebook computer, an Ultra mobile personal computer (UMPC, ultra-mobile Personal Computer), a netbook, a personal digital assistant (PDA, personal Digital Assistant), or the like, and the terminal device may be a station (ST, stand) in a WLAN, for example, a cellular phone, a cordless phone, a session initiation protocol (SIP, session Initiation Protocol) phone, a wireless local loop (WLL, wireless Local Loop) station, a personal digital processing (PDA, personal Digital Assistant) device, a handheld device having a wireless communication function, a computing device, or other processing device connected to a wireless modem, a computer, a laptop computer, a handheld communication device, a handheld computing device, a satellite wireless device, or the like. The embodiment of the invention does not limit the specific type of the terminal equipment.
The processor may be a central processing unit (CPU, central Processing Unit), but may also be other general purpose processors, digital signal processors (DSP, digital Signal Processor), application specific integrated circuits (ASIC, application Specific Integrated Circuit), off-the-shelf programmable gate arrays (FPGA, field-Programmable Gate Array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may in some embodiments be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may in other embodiments also be an external storage device of the terminal device, such as a plug-in hard disk provided on the terminal device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. Further, the memory may also include both an internal storage unit and an external storage device of the terminal device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs, etc., such as program code for the computer program, etc. The memory may also be used to temporarily store data that has been output or is to be output.
It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be found in the method embodiment section, and will not be described herein.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.
Claims (10)
1. A method for oversampling unbalanced data, comprising:
step 1, acquiring a credit card abnormal transaction data set to be processed, wherein the credit card abnormal transaction data set is used as an unbalanced data set, and the unbalanced data set comprises a minority sample set consisting of a plurality of minority samples and a majority sample set consisting of a plurality of majority samples;
step 2, randomly selecting part of minority class samples in the minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; each of the natural nearest neighbor sets includes a plurality of nearest neighbor elements of the core sample point;
step 3, calculating the proportion of the majority sample in each natural nearest neighbor set according to the space distribution condition of each sample in the unbalanced data set;
step 4, determining the spatial distribution condition of each core sample point in the unbalanced data set according to the proportion of the majority sample in each natural nearest neighbor set;
step 5, determining the number weight of the new samples generated in the natural nearest neighbor domain according to the spatial distribution condition of each core sample point in the unbalanced data set;
step 6, determining the position weight of a new sample point generated in each natural nearest neighbor according to the spatial distribution condition of each core sample point in the unbalanced data set;
and 7, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain according to the quantity weight and the position weight, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud.
2. The method of oversampling data in class unbalance according to claim 1, comprising, before said step 2:
and calculating the standard Euclidean distance between the two minority class samples, wherein the formula is as follows:
wherein ,indicate->Minority class sample->And->Minority class sample->Distance between (2) and (2)>、/>Respectively represent +.>Minority class sample->First->Minority class sample->In->The values in the dimensions of the features of the individual samples,representing a minority class sample point set +.>In->Standard deviation in the characteristic dimension of individual samples +.>Is the number of sample features.
3. The unbalanced-like data oversampling method of claim 2, wherein step 2 comprises:
randomly selecting a plurality of minority class samples in the minority class sample set as core sample points;
Regarding the minority class samples except the core sample point in the minority class sample set, if the nearest neighbor set of the minority class samples contains the core sample point, the minority class samples are considered to be the inverse of the core sample pointNeighbor element, said inverse->Neighbor element composition inverse->Neighbor set->;
Aiming at the minority class samples except the core sample points in the minority class sample set, if the nearest neighbor set of the minority class samples does not contain the core sample points, the minority class samples are considered to be outliers, and the minority class samples are discarded;
redefining if the intersection is emptyRepeatedly selecting +.>Neighbor set and inverseA neighbor set;
if the intersection is a non-empty set, then the natural nearest neighbor set isRedefinition ofRepeatedly find the value of natural nearest neighbor set +.>;
4. A method of oversampling class-unbalanced data as claimed in claim 3, wherein the proportion of the majority class samples in each of the natural nearest neighbor sets is calculated by:
5. The method of oversampling data in class unbalance of claim 4, wherein the step 4 comprises:
according to the proportion of the majority sample in each natural nearest neighbor set;
wherein ,generating control weights for samples of core sample points, +.>For controlling parameters +.>;
6. The method of claim 5, wherein the number weights of new samples generated in the natural nearest neighbor are based on a number of the new samplesThe method comprises the following steps:
7. The method of claim 6, wherein the location weights of the new samples generated in the natural nearest neighbor are:
8. The method of oversampling of data in class unbalance of claim 7, wherein the step 7 comprises:
determining the number of new samples to be generated in the unbalanced dataset, wherein the expression is:
Calculating the number of new samples required to be generated in each natural nearest neighbor domain, wherein the expression is as follows:
generating a formula according to the region sample generation for each natural nearest neighborSample characteristics of the new samples, the regional sample generation formula is:
wherein ,representing +.>The first ∈of the new sample point generated>Sample characteristics,/->Representing core sample points and in natural nearest neighborsSample characteristic differences of other sample points, +.>Is a random number with the value range of 0,1];
Obtaining a new sample as the sample characteristic of the new sample generated in each natural nearest neighbor domainNew sample->By->A sample feature formation;
And summarizing the new sample set and the unbalanced data set to obtain a balanced data set.
9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the unbalance-like data oversampling method according to any of the claims 1 to 7.
10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the unbalance-like data oversampling method according to any one of claims 1 to 7 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310397766.7A CN116108387B (en) | 2023-04-14 | 2023-04-14 | Unbalanced data oversampling method and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310397766.7A CN116108387B (en) | 2023-04-14 | 2023-04-14 | Unbalanced data oversampling method and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116108387A true CN116108387A (en) | 2023-05-12 |
CN116108387B CN116108387B (en) | 2023-07-04 |
Family
ID=86264176
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310397766.7A Active CN116108387B (en) | 2023-04-14 | 2023-04-14 | Unbalanced data oversampling method and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116108387B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868775A (en) * | 2016-03-23 | 2016-08-17 | 深圳市颐通科技有限公司 | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm |
CN110275910A (en) * | 2019-06-20 | 2019-09-24 | 东北大学 | A kind of oversampler method of unbalanced dataset |
CN112633426A (en) * | 2021-03-11 | 2021-04-09 | 腾讯科技(深圳)有限公司 | Method and device for processing data class imbalance, electronic equipment and storage medium |
KR20220007470A (en) * | 2020-07-10 | 2022-01-18 | 박수환 | A Design of a Location-based Fraud Detection System in Mobile Payment Service Device and Operation Method using Machine Learning Technique |
CN114862404A (en) * | 2022-05-05 | 2022-08-05 | 湖北工业大学 | Credit card fraud detection method and device based on cluster samples and limit gradients |
US20220383322A1 (en) * | 2021-05-30 | 2022-12-01 | Actimize Ltd. | Clustering-based data selection for optimization of risk predictive machine learning models |
-
2023
- 2023-04-14 CN CN202310397766.7A patent/CN116108387B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868775A (en) * | 2016-03-23 | 2016-08-17 | 深圳市颐通科技有限公司 | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm |
CN110275910A (en) * | 2019-06-20 | 2019-09-24 | 东北大学 | A kind of oversampler method of unbalanced dataset |
KR20220007470A (en) * | 2020-07-10 | 2022-01-18 | 박수환 | A Design of a Location-based Fraud Detection System in Mobile Payment Service Device and Operation Method using Machine Learning Technique |
CN112633426A (en) * | 2021-03-11 | 2021-04-09 | 腾讯科技(深圳)有限公司 | Method and device for processing data class imbalance, electronic equipment and storage medium |
US20220383322A1 (en) * | 2021-05-30 | 2022-12-01 | Actimize Ltd. | Clustering-based data selection for optimization of risk predictive machine learning models |
CN114862404A (en) * | 2022-05-05 | 2022-08-05 | 湖北工业大学 | Credit card fraud detection method and device based on cluster samples and limit gradients |
Also Published As
Publication number | Publication date |
---|---|
CN116108387B (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9953160B2 (en) | Applying multi-level clustering at scale to unlabeled data for anomaly detection and security | |
Kuehnhausen et al. | Trusting smartphone apps? To install or not to install, that is the question | |
US20190035015A1 (en) | Method and apparatus for obtaining a stable credit score | |
WO2021159766A1 (en) | Data identification method and apparatus, and device, and readable storage medium | |
US10504028B1 (en) | Techniques to use machine learning for risk management | |
WO2020181907A1 (en) | Decision-making optimization method and apparatus | |
CN111090780A (en) | Method and device for determining suspicious transaction information, storage medium and electronic equipment | |
CN109598414A (en) | Risk evaluation model training, methods of risk assessment, device and electronic equipment | |
US20200286091A1 (en) | Automated multi-currency refund service | |
WO2023009590A1 (en) | Authenticating based on user behavioral transaction patterns | |
CN111275416A (en) | Digital currency abnormal transaction detection method and device, electronic equipment and medium | |
CN111582872A (en) | Abnormal account detection model training method, abnormal account detection device and abnormal account detection equipment | |
CN116108387B (en) | Unbalanced data oversampling method and related equipment | |
CN111275071B (en) | Prediction model training method, prediction device and electronic equipment | |
CN111245815A (en) | Data processing method, data processing device, storage medium and electronic equipment | |
CN112446777A (en) | Credit evaluation method, device, equipment and storage medium | |
CN115481300A (en) | Data imbalance classification oversampling method, device, equipment and medium based on natural neighborhood density | |
CN114003648B (en) | Identification method and device for risk transaction group partner, electronic equipment and storage medium | |
CN108235228B (en) | Safety verification method and device | |
CN112488825B (en) | Object transaction method and device based on blockchain | |
CN115601044A (en) | Fraud detection model training method, fraud detection device and electronic equipment | |
CN113177609A (en) | Method, device, system and storage medium for processing data class imbalance | |
CN113988670A (en) | Comprehensive enterprise credit risk early warning method and system | |
CN114706899A (en) | Express delivery data sensitivity calculation method and device, storage medium and equipment | |
CN111860655A (en) | User processing method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |