CN116108387A - Unbalanced data oversampling method and related equipment - Google Patents

Unbalanced data oversampling method and related equipment Download PDF

Info

Publication number
CN116108387A
CN116108387A CN202310397766.7A CN202310397766A CN116108387A CN 116108387 A CN116108387 A CN 116108387A CN 202310397766 A CN202310397766 A CN 202310397766A CN 116108387 A CN116108387 A CN 116108387A
Authority
CN
China
Prior art keywords
sample
samples
nearest neighbor
natural
core sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310397766.7A
Other languages
Chinese (zh)
Other versions
CN116108387B (en
Inventor
刘利枚
黄志伟
刘星宝
石彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Technology
Original Assignee
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Technology filed Critical Hunan University of Technology
Priority to CN202310397766.7A priority Critical patent/CN116108387B/en
Publication of CN116108387A publication Critical patent/CN116108387A/en
Application granted granted Critical
Publication of CN116108387B publication Critical patent/CN116108387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an unbalanced data oversampling method and related equipment, wherein the method comprises the following steps: acquiring a credit card abnormal transaction data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting a plurality of minority class samples as core sample points, and determining a natural nearest neighbor set and a natural nearest neighbor; calculating the proportion of most samples in each natural nearest neighbor set according to the space distribution condition of the samples in the unbalanced data set; determining the space distribution condition of each core sample point in the unbalanced data set, the quantity weight and the position weight of the generated new sample according to the proportion; acquiring sample characteristics of a new sample according to the quantity weight and the position weight, acquiring a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to acquire a balanced data set for predicting financial fraud; the accuracy of predicting financial fraud is improved.

Description

Unbalanced data oversampling method and related equipment
Technical Field
The invention relates to the technical field of financial unbalanced data processing, in particular to an unbalanced data oversampling method and related equipment.
Background
With the continuous development of artificial intelligence technology, the technology of collecting, storing and processing data is also advancing continuously. Machine learning and data mining techniques that incorporate multiple disciplines have become important methods for analyzing and processing data and converting it into desired knowledge. Conventional machine learning generally assumes that the distribution of data categories is balanced, with the data categories corresponding to a small number of samples. However, in practical situations, data category distribution imbalance is prevalent among various application areas. For example, in credit card fraud detection, fraudulent transactions may account for only 1% of the total transactions, and the algorithm may only need to evaluate all transactions as normal transactions to obtain a classification accuracy of 99%, which ignores the possibility of fraudulent transactions and causes serious damage to businesses and personal properties. Therefore, the balancing treatment for the class unbalance characteristics of the data has extremely high research value and application prospect.
Existing class imbalance processing for data mainly includes oversampling for minority class samples or undersampling for majority class samples, or a combination of both methods. The oversampling refers to a method for achieving data class imbalance by adding a few class samples through a certain method and technology.
The standard Euclidean distance is based on the Euclidean distance, the value of the sample in each dimension is normalized to be expected to be 0, and the variance is 1.
Natural nearest neighbor and natural nearest neighbor refer to the existence of neighbor values
Figure SMS_3
Sample point set
Figure SMS_5
For->
Figure SMS_7
So that->
Figure SMS_2
And->
Figure SMS_6
Is->
Figure SMS_8
The samples are points on the nearest path, then +.>
Figure SMS_9
And->
Figure SMS_1
The sample points are adjacent to each other naturally, the area formed by the connecting lines of the adjacent points becomes the nearest natural neighborhood,
Figure SMS_4
is the natural nearest neighbor value.
At present, most of the existing oversampling methods are based on an SMOTE algorithm, and a method for generating a certain number of minority sample points by randomly selecting minority samples and neighbor samples thereof to conduct linear interpolation; the core of the algorithm is
Figure SMS_10
Nearest neighbor algorithm, which has nearest neighbor ∈>
Figure SMS_11
The value determination is complicated, and the fixation is set>
Figure SMS_12
The value can cause problems such as the quality of the generated sample is reduced; meanwhile, the SOMTE method is insensitive to outliers of few types of samples, and when sample points are selected for linear interpolation, the outliers are easy to obtain, so that a large number of noise samples are generated.
Disclosure of Invention
The invention provides an unbalanced data oversampling method and related equipment, and aims to eliminate interference of outliers on sample characteristics in a balanced data set and improve accuracy of predicting financial fraud.
In order to achieve the above object, the present invention provides a method for oversampling unbalanced data, comprising:
step 1, acquiring a credit card abnormal transaction data set to be processed, wherein the credit card abnormal transaction data set is used as an unbalanced data set, and the unbalanced data set comprises a minority sample set consisting of a plurality of minority samples and a majority sample set consisting of a plurality of majority samples;
step 2, randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; each natural nearest neighbor set comprises a plurality of nearest neighbor elements of a core sample point;
step 3, calculating the proportion of most samples in each natural nearest neighbor set according to the space distribution condition of each sample in the unbalanced data set;
step 4, determining the spatial distribution condition of each core sample point in the unbalanced data set according to the proportion of most samples in each natural nearest neighbor set;
step 5, determining the number weight of the new samples generated in the natural nearest neighbor domain according to the spatial distribution condition of each core sample point in the unbalanced data set;
step 6, determining the position weight of a new sample point generated in each natural nearest neighbor according to the spatial distribution condition of each core sample point in the unbalanced data set;
and 7, acquiring sample characteristics of the new samples generated in each natural nearest neighbor domain according to the number weight and the position weight, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud.
Further, before step 2, the method includes:
the standard Euclidean distance between two minority class samples is calculated as follows:
Figure SMS_13
wherein ,
Figure SMS_19
indicate->
Figure SMS_16
Minority class sample->
Figure SMS_26
And->
Figure SMS_17
Minority class sample->
Figure SMS_25
Distance between (2) and (2)>
Figure SMS_22
Figure SMS_27
Respectively represent +.>
Figure SMS_18
Minority class sample->
Figure SMS_28
First->
Figure SMS_14
Minority class sample->
Figure SMS_23
In->
Figure SMS_15
Values in the characteristic dimension of the individual samples, +.>
Figure SMS_24
Representing a minority class sample point set +.>
Figure SMS_21
In->
Figure SMS_29
Standard deviation in the characteristic dimension of individual samples +.>
Figure SMS_20
Is the number of sample features.
Further, step 2 includes:
randomly selecting part of minority class samples in a minority class sample set as core sample points;
selecting the core sample points for each core sample point
Figure SMS_30
Each neighbor element;
selecting the core sample point
Figure SMS_31
The neighboring elements constitute->
Figure SMS_32
Neighbor set->
Figure SMS_33
Regarding the minority class samples except the core sample point in the minority class sample set, if the nearest neighbor set of the minority class samples contains the core sample point, the minority class samples are considered to be the inverse of the core sample point
Figure SMS_34
Neighbor element, said inverse->
Figure SMS_35
Neighbor element composition inverse->
Figure SMS_36
Neighbor set->
Figure SMS_37
Aiming at the minority class samples except the core sample points in the minority class sample set, if the nearest neighbor set of the minority class samples does not contain the core sample points, the minority class samples are considered to be outliers, and the minority class samples are discarded;
solving for the said
Figure SMS_38
Neighbor set->
Figure SMS_39
Is>
Figure SMS_40
Neighbor set->
Figure SMS_41
Is a complex of the intersection of (a) and (b);
redefining if the intersection is empty
Figure SMS_42
Repeatedly selecting +.>
Figure SMS_43
Neighbor set and inverse->
Figure SMS_44
A neighbor set;
if the intersection is a non-empty set, then the natural nearest neighbor set is
Figure SMS_45
Redefining +.>
Figure SMS_46
Repeatedly find the value of natural nearest neighbor set +.>
Figure SMS_47
Up to the inverse of the core sample point
Figure SMS_48
The neighbor set is not changed, and a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set are obtained.
Further, the proportion of the core sample points in most class samples in each natural nearest neighbor set is calculated, and the expression is as follows:
Figure SMS_49
wherein ,
Figure SMS_50
indicating that the core sample point is at +.>
Figure SMS_51
The proportion of most types of samples in the natural nearest neighbor set,
Figure SMS_52
is->
Figure SMS_53
The number of most classes of samples in the natural nearest neighbor set,/->
Figure SMS_54
Representing the number of neighbor elements of the core sample point.
Further, step 4 includes:
according to the proportion of most samples in each natural nearest neighbor set;
if it is
Figure SMS_55
,/>
Figure SMS_56
If it is
Figure SMS_57
,/>
Figure SMS_58
If it is
Figure SMS_59
,/>
Figure SMS_60
wherein ,
Figure SMS_61
sample as core sample pointThe present generates control weights, ++>
Figure SMS_62
For controlling parameters +.>
Figure SMS_63
Generating control weights from the samples
Figure SMS_64
The spatial distribution of each core sample point in the unbalanced data set is determined.
Further, the number weight of new samples generated in the natural nearest neighbor
Figure SMS_65
The method comprises the following steps:
Figure SMS_66
wherein ,
Figure SMS_67
generating control weights for samples of core sample points, +.>
Figure SMS_68
Representation->
Figure SMS_69
Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
Further, the position weights of the new sample points generated in the natural nearest neighbor are:
Figure SMS_70
/>
wherein ,
Figure SMS_71
generating control weights for samples of core sample points, +.>
Figure SMS_72
Representation->
Figure SMS_73
Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
Further, step 7 includes:
determining the number of new samples to be generated in the unbalanced data set, wherein the expression is as follows:
Figure SMS_74
wherein ,
Figure SMS_75
for balancing parameters for controlling the number of new samples, +.>
Figure SMS_76
The number of new samples to be generated in each natural nearest neighbor is calculated, and the expression is:
Figure SMS_77
generating a formula according to the region sample generation formula for each natural nearest neighbor
Figure SMS_78
Sample characteristics of the new samples, and a regional sample generation formula is as follows:
Figure SMS_79
wherein ,
Figure SMS_80
representing +.>
Figure SMS_81
The first ∈of the new sample point generated>
Figure SMS_82
Sample characteristics,/->
Figure SMS_83
Sample characteristic difference value representing core sample point and other sample points in natural nearest neighbor, and +.>
Figure SMS_84
Is a random number with the value range of 0,1];
Obtaining a new sample as the sample characteristic of the new sample generated in each natural nearest neighbor domain
Figure SMS_85
New sample->
Figure SMS_86
By->
Figure SMS_87
A sample feature formation;
from the following components
Figure SMS_88
Combining the new samples to obtain a new sample set of +.>
Figure SMS_89
And summarizing the new sample set and the unbalanced data set to obtain a balanced data set.
The invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements an unbalanced-like data oversampling method.
The invention also provides a terminal device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, the processor implementing an unbalanced data like oversampling method when executing the computer program.
The scheme of the invention has the following beneficial effects:
the invention uses a credit card abnormal transaction data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; calculating the proportion of the core sample points in most types of samples in each natural nearest neighbor set according to the spatial distribution condition of each sample in the unbalanced data set; according to the proportion of most samples in each natural nearest neighbor set, determining the spatial distribution condition of each core sample point in an unbalanced data set, the number weight of new samples generated in the natural nearest neighbor and the position weight of new sample points generated in the natural nearest neighbor; according to the quantity weight and the position weight, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud; compared with the prior art, the method solves the problem that the neighbor value needs to be frequently determined in the traditional oversampling method by introducing the natural nearest neighbor method, can realize self-adaptive selection of sample adjacent points, eliminates interference of outlier points on sample characteristics in a balance data set, adaptively distributes the number of samples required to be generated according to the distribution state of data around a few sample points in the neighborhood in the formed natural neighbor, improves the quality of the generated samples, enlarges the range of the generated samples, and improves the precision of predicting financial fraud behaviors.
Other advantageous effects of the present invention will be described in detail in the detailed description section which follows.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a flowchart showing step 2 according to an embodiment of the present invention;
FIG. 3 is a flowchart showing steps 3-6 in an embodiment of the present invention;
FIG. 4 is a flowchart showing step 7 according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of identifying outliers according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of natural nearest neighbor and natural neighbor selection of a core sample point according to an embodiment of the present invention;
FIG. 7 shows an embodiment of the present invention
Figure SMS_90
The core sample points are schematic diagrams of outliers;
FIG. 8 shows the following steps in an embodiment of the present invention
Figure SMS_91
Schematic diagram of nearest neighbor element of core sample point;
FIG. 9 is a diagram of an embodiment of the present invention
Figure SMS_92
Schematic diagram of nearest neighbor element of core sample point;
FIG. 10 shows an embodiment of the present invention
Figure SMS_93
Schematic diagram of nearest neighbor element of core sample point;
FIG. 11 is a schematic diagram of a natural nearest neighbor of a core sample point according to an embodiment of the present invention;
FIG. 12 is a schematic diagram of generating a new sample according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, a locked connection, a removable connection, or an integral connection; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, the technical sample features described below in the various embodiments of the invention may be combined with one another as long as they do not conflict with one another.
The invention provides an unbalanced data oversampling method and related equipment aiming at the existing problems.
As shown in fig. 1, an embodiment of the present invention provides a kind of unbalanced data oversampling method, including:
step 1, acquiring a credit card abnormal transaction data set to be processed, wherein the credit card abnormal transaction data set is used as an unbalanced data set, and the unbalanced data set comprises a minority sample set consisting of a plurality of minority samples and a majority sample set consisting of a plurality of majority samples;
step 2, randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; each natural nearest neighbor set comprises a plurality of nearest neighbor elements of a core sample point;
step 3, calculating the proportion of most samples in each natural nearest neighbor set according to the space distribution condition of each sample in the unbalanced data set;
step 4, determining the spatial distribution condition of each core sample point in the unbalanced data set according to the proportion of most samples in each natural nearest neighbor set;
step 5, determining the number weight of the new samples generated in the natural nearest neighbor domain according to the spatial distribution condition of each core sample point in the unbalanced data set;
step 6, determining the position weight of a new sample point generated in each natural nearest neighbor according to the spatial distribution condition of each core sample point in the unbalanced data set;
and 7, acquiring sample characteristics of the new samples generated in each natural nearest neighbor domain according to the number weight and the position weight, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud.
Specifically, step 1 includes: acquiring a pending credit card abnormal transaction data set as an unbalanced data set
Figure SMS_94
Unbalanced data set->
Figure SMS_95
Comprising a minority class sample set consisting of a plurality of minority class samples +.>
Figure SMS_96
And a majority sample set consisting of a plurality of majority samples
Figure SMS_97
And->
Figure SMS_98
,/>
Figure SMS_99
Specifically, before step 2, it includes:
calculating a standard Euclidean distance between two minority class samples, the distance set being denoted as
Figure SMS_100
Figure SMS_101
Wherein few classes of samples->
Figure SMS_102
The distance set for the other minority class samples is +.>
Figure SMS_103
The standard Euclidean distance formula is as follows:
Figure SMS_104
wherein ,
Figure SMS_113
indicate->
Figure SMS_112
Minority class sample->
Figure SMS_119
And->
Figure SMS_111
Minority class sample->
Figure SMS_120
Distance between (2) and (2)>
Figure SMS_109
Figure SMS_114
Respectively represent +.>
Figure SMS_106
Minority class sample->
Figure SMS_116
First->
Figure SMS_105
Minority class sample->
Figure SMS_115
In->
Figure SMS_110
The values in the dimensions of the individual features,
Figure SMS_118
representing a minority class sample point set +.>
Figure SMS_108
In->
Figure SMS_117
Standard deviation in individual characteristic dimensions +.>
Figure SMS_107
Is the number of sample features.
Specifically, as shown in fig. 2, step 2 includes:
randomly selecting part of minority class samples in a minority class sample set as core sample points;
for each core sample point, selecting a core sample point
Figure SMS_121
Each neighbor element;
selecting core sample points
Figure SMS_122
The neighboring elements constitute->
Figure SMS_123
Neighbor set->
Figure SMS_124
For a minority class of samples in the minority class of sample set except for the core sample point, if the nearest neighbor set of the minority class of samples contains the core sample point,the minority class samples are considered as the inverse of the core sample points
Figure SMS_125
Neighbor element, reverse->
Figure SMS_126
Neighbor element composition inverse->
Figure SMS_127
Neighbor set->
Figure SMS_128
Aiming at a minority class sample except a core sample point in a minority class sample set, if a nearest neighbor set of the minority class sample does not contain the core sample point, the minority class sample is considered to be an outlier, and the minority class sample is discarded;
obtaining
Figure SMS_129
Neighbor set->
Figure SMS_130
And reverse->
Figure SMS_131
Neighbor set->
Figure SMS_132
Is a complex of the intersection of (a) and (b);
redefining if the intersection is empty
Figure SMS_133
Repeatedly selecting +.>
Figure SMS_134
Neighbor set and inverse->
Figure SMS_135
A neighbor set;
if the intersection is a non-empty set, the natural nearest neighbor set is
Figure SMS_136
Redefining +.>
Figure SMS_137
Repeatedly find the value of natural nearest neighbor set +.>
Figure SMS_138
Up to the inverse of the core sample point
Figure SMS_139
The neighbor set is not changed, and a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set are obtained.
In the embodiment of the invention, the number of neighbor elements is initialized
Figure SMS_140
In the distance set between the core sample point and the adjacent element, sequentially selecting from small to large
Figure SMS_141
The nearest neighbor element with the smallest distance value is selected as the first nearest neighbor element to form a nearest neighbor set which does not contain the core sample point, such as the core sample point +.>
Figure SMS_142
Is->
Figure SMS_143
Neighbor set->
Figure SMS_144
For the current
Figure SMS_147
At this point, if the nearest neighbor set of the minority class samples other than the core sample point contains the core sample point +.>
Figure SMS_149
The minority classThe sample is core sample point->
Figure SMS_151
Is>
Figure SMS_146
Neighbor elements, element set is recorded as
Figure SMS_148
If the core sample point->
Figure SMS_150
No adverse qi->
Figure SMS_152
Nearest neighbor, then define the number of nearest neighbor elements +.>
Figure SMS_145
Repeating the two steps, if the point still has no reverse neighbor, judging the point as an outlier point, discarding the minority class samples, and reselecting a core sample point;
finding core sample points
Figure SMS_153
Is->
Figure SMS_154
Neighbor set->
Figure SMS_155
And reverse->
Figure SMS_156
Neighbor set->
Figure SMS_157
Is the intersection of natural nearest neighbors->
Figure SMS_158
I.e. +.>
Figure SMS_159
Judging the inverse
Figure SMS_161
Neighbor set->
Figure SMS_164
Whether to increase; if you are reverse->
Figure SMS_166
Neighbor set->
Figure SMS_162
The neighbor element in the middle is increased or is +.>
Figure SMS_163
Define +.>
Figure SMS_165
Repeating the steps of the 3 steps; if not, core sample point->
Figure SMS_167
Corresponding to natural nearest neighbor of ∈>
Figure SMS_160
The corresponding natural neighborhood is a space inner region formed by natural nearest neighbor set elements; />
And repeatedly searching the unbalanced data set to obtain a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to the natural nearest neighbor set.
Specifically, as shown in fig. 3, step 3 includes:
selecting different neighbor elements, and calculating core sample points in sample space of whole unbalanced data set
Figure SMS_168
The ratio of a plurality of types of samples in the natural nearest neighbor set of the core sample point is +.>
Figure SMS_169
The calculation formula of (2) is as follows:
Figure SMS_170
wherein ,
Figure SMS_171
indicating that the core sample point is at +.>
Figure SMS_172
The proportion of most types of samples in the natural nearest neighbor set,
Figure SMS_173
is->
Figure SMS_174
The number of most classes of samples in the natural nearest neighbor set,/->
Figure SMS_175
Representing the number of neighbor elements of the core sample point.
Specifically, step 4 includes:
according to the proportion of most samples in each natural nearest neighbor set;
increasing the data generation weight of the core sample points with more sample points of most types in the natural nearest neighbor set, namely
If it is
Figure SMS_176
,/>
Figure SMS_177
If it is
Figure SMS_178
,/>
Figure SMS_179
If it is
Figure SMS_180
,/>
Figure SMS_181
wherein ,
Figure SMS_182
generating control weights for samples of core sample points, +.>
Figure SMS_183
For controlling parameters +.>
Figure SMS_184
Generating control weights from samples
Figure SMS_185
The spatial distribution of each core sample point in the unbalanced data set is determined.
Specifically, the number weight of minority class samples generated in natural nearest neighbor
Figure SMS_186
The method comprises the following steps:
Figure SMS_187
wherein ,
Figure SMS_188
generating control weights for samples of core sample points, +.>
Figure SMS_189
Representation->
Figure SMS_190
Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
Specifically, the location weights of the minority class sample points generated in the natural nearest neighbor are:
Figure SMS_191
wherein ,
Figure SMS_192
generating control weights for samples of core sample points, +.>
Figure SMS_193
Representation->
Figure SMS_194
Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
Specifically, as shown in fig. 4, step 7 includes:
determining the number of new samples to be generated in the unbalanced data set, wherein the expression is as follows:
Figure SMS_195
wherein ,
Figure SMS_196
for balancing parameters for controlling the number of new samples, +.>
Figure SMS_197
The number of new samples to be generated in each natural nearest neighbor is calculated, and the expression is:
Figure SMS_198
generating a formula according to the region sample generation formula for each natural nearest neighbor
Figure SMS_199
Sample characteristics of the new samples, the regional sample generation formula is:
Figure SMS_200
wherein ,
Figure SMS_201
representing +.>
Figure SMS_202
The first ∈of the new sample point generated>
Figure SMS_203
Sample characteristics,/->
Figure SMS_204
Sample characteristic difference value representing core sample point and other sample points in natural nearest neighbor, and +.>
Figure SMS_205
Is a random number with the value range of 0,1];
Obtaining a new sample as the sample characteristic of the new sample generated in each natural nearest neighbor domain
Figure SMS_206
New sample->
Figure SMS_207
By->
Figure SMS_208
A sample feature formation;
from the following components
Figure SMS_209
Combining the new samples to obtain a new sample set of +.>
Figure SMS_210
And summarizing the new sample set and the unbalanced data set to obtain a balanced data set.
Specifically, with respect to the identification and discarding of outliers, as shown in FIGS. 5 and 6, when the core sample point is an outlier
Figure SMS_213
When (I)>
Figure SMS_214
Point->
Figure SMS_217
The nearest neighbor element of (2) is sample->
Figure SMS_212
Sample->
Figure SMS_215
The nearest neighbor element of (2) is sample->
Figure SMS_216
Thus core sample point
Figure SMS_218
Does not have the reverse->
Figure SMS_211
A neighbor element;
redefinition
Figure SMS_219
Circulating;
when (when)
Figure SMS_223
At this time, as shown in FIG. 7, core sample point +.>
Figure SMS_226
The nearest neighbor element of (2) is sample->
Figure SMS_228
Sample->
Figure SMS_222
And sample point->
Figure SMS_225
The nearest neighbor element of (2) is sample->
Figure SMS_229
Sample->
Figure SMS_231
Sample->
Figure SMS_220
Is the nearest neighbor element of the sample/>
Figure SMS_224
Sample->
Figure SMS_227
Therefore, about core sample point->
Figure SMS_230
Is still an empty set, so core sample points are identified +.>
Figure SMS_221
Is an outlier.
As shown in FIG. 8, when the core sample point is
Figure SMS_234
,/>
Figure SMS_237
The nearest neighbor element of the core sample point is sample +.>
Figure SMS_241
Samples of
Figure SMS_235
The nearest neighbor element of (2) is sample->
Figure SMS_238
Therefore, sample->
Figure SMS_240
For core sample point->
Figure SMS_243
Is>
Figure SMS_232
Neighboring elements, and at core sample points
Figure SMS_236
Is the nearest neighbor set of (1), so sample +.>
Figure SMS_239
For core sample point->
Figure SMS_242
Defining +.>
Figure SMS_233
Carrying out the next step;
when (when)
Figure SMS_246
At this time, as shown in FIG. 9, core sample point +.>
Figure SMS_250
The nearest neighbor element of (2) is sample->
Figure SMS_255
Sample->
Figure SMS_245
Sample->
Figure SMS_248
The nearest neighbor element of (2) is the core sample point +.>
Figure SMS_252
Sample->
Figure SMS_256
Sample->
Figure SMS_244
The nearest neighbor element of (2) is the core sample point +.>
Figure SMS_249
Sample->
Figure SMS_253
Therefore, sample->
Figure SMS_257
Sample->
Figure SMS_247
For core sample point->
Figure SMS_251
Defining +.>
Figure SMS_254
Carrying out the next step;
when (when)
Figure SMS_267
At this time, as shown in FIG. 10, core sample point +.>
Figure SMS_265
The nearest neighbor element of (2) is sample->
Figure SMS_277
Sample->
Figure SMS_262
Sample->
Figure SMS_275
Sample->
Figure SMS_268
The nearest neighbor element of (2) is the core sample point +.>
Figure SMS_279
Sample->
Figure SMS_264
Sample->
Figure SMS_274
Sample->
Figure SMS_258
The nearest neighbor element of (2) is the core sample point +.>
Figure SMS_270
Sample->
Figure SMS_259
Sample->
Figure SMS_271
Sample->
Figure SMS_269
The nearest neighbor element of (2) is sample->
Figure SMS_278
Sample->
Figure SMS_266
Sample->
Figure SMS_273
Core sample point->
Figure SMS_263
Natural reverse->
Figure SMS_276
The neighbor set is unchanged, core sample point +.>
Figure SMS_260
Is +.>
Figure SMS_272
、/>
Figure SMS_261
The natural nearest neighbor is shown in FIG. 11;
determining the natural nearest neighbor set and the natural nearest field of the residual core sample points, solving the generation quantity weight and the sample generation weight of the sample points in the respective natural nearest field, and generating according to the quantity weight, the position weight and the regional sample generation formula
Figure SMS_280
Sample characteristics of the new samples, a new minority class of samples is generated, as shown in fig. 12.
In the embodiment of the invention, an unbalanced data set is obtained for example, and the unbalanced data set is classified into a class ratio of 12:1, a credit card abnormal transaction data set;
step 2, randomly selecting core sample points
Figure SMS_281
=[1.2023,-0.6947,-5.5263,6.6624,-8.5255,0.7427,-7.6787]Specifically, trade characteristics= [ regional economy information, social status information, trade time, trade amount period, geographical position, time difference of geographical position, trade amount]Because of the privacy of the financial data, embodiments of the present invention desensitize it;
first calculate core sample points
Figure SMS_284
Distance from other sample points, select +.>
Figure SMS_287
,/>
Figure SMS_290
The nearest neighbor element of (2) is sample->
Figure SMS_283
=[1.2498,-0.7183,-5.3903,6.4542,-8.4853,0.6353,-7.0199]Sample->
Figure SMS_286
The nearest neighbor element of (2) is the core sample point +.>
Figure SMS_288
Therefore, sample->
Figure SMS_291
For core sample point->
Figure SMS_282
Natural reverse->
Figure SMS_285
Neighbor elements, definition
Figure SMS_289
Circulating;
Figure SMS_293
core sample Point->
Figure SMS_297
The nearest neighbor element of (2) is sample->
Figure SMS_300
Sample->
Figure SMS_294
Sample->
Figure SMS_298
=[1.7035,-1.3053,-6.7167,6.3536,-8.6016,0.4499,-7.5062]Sample->
Figure SMS_301
The nearest neighbor element of (2) is sample->
Figure SMS_303
Sample->
Figure SMS_292
Therefore, sample->
Figure SMS_296
For core sample point->
Figure SMS_299
Natural reverse->
Figure SMS_302
Neighbor element, definition->
Figure SMS_295
Circulating;
Figure SMS_306
core sample Point->
Figure SMS_309
The nearest neighbor element of (2) is sample->
Figure SMS_314
Sample->
Figure SMS_307
Sample->
Figure SMS_311
Sample->
Figure SMS_315
=[1.7017,-1.4394,-6.9999,6.3162,-8.6708,0.316,-7.4177]Sample->
Figure SMS_317
The nearest neighbor element of (2) is sample->
Figure SMS_304
Sample->
Figure SMS_308
Sample->
Figure SMS_312
Therefore, sample->
Figure SMS_316
For core sample point->
Figure SMS_305
Natural reverse->
Figure SMS_310
Neighbor elements, definition
Figure SMS_313
Circulating;
Figure SMS_323
core sample Point->
Figure SMS_322
The nearest neighbor element of (2) is sample->
Figure SMS_336
Sample->
Figure SMS_321
Sample->
Figure SMS_332
Sample->
Figure SMS_328
Sample->
Figure SMS_339
=[1.5156,-1.2072,-6.2346,5.4507,-7.3337,1.3612,-6.6081]Sample->
Figure SMS_320
The nearest neighbor element of (2) is sample->
Figure SMS_335
Sample->
Figure SMS_318
Sample->
Figure SMS_330
Sample->
Figure SMS_325
Therefore, sample->
Figure SMS_337
Not core sample point->
Figure SMS_324
Is>
Figure SMS_334
Neighbor element, so core sample point->
Figure SMS_326
Is { +.>
Figure SMS_338
,/>
Figure SMS_327
,/>
Figure SMS_333
Natural nearest neighbor is +.>
Figure SMS_319
Area formed by connecting lines between departure points +.>
Figure SMS_331
,/>
Figure SMS_329
Step 3: first, the proportion of most types of samples in the natural nearest neighbor set of each core sample point is calculated, wherein the core sample points
Figure SMS_340
The proportion of most types of samples in the natural nearest neighbor set is +.>
Figure SMS_341
Figure SMS_342
So sample generation control weight +.>
Figure SMS_343
;/>
Step 4, based on the weight of other core sample points, the method is represented by the formula
Figure SMS_344
Obtaining, the number weight of minority class samples generated in the natural nearest neighbor +.>
Figure SMS_345
Step 5, calculating core sample points
Figure SMS_349
Natural nearest neighbor element->
Figure SMS_352
,/>
Figure SMS_355
,/>
Figure SMS_348
The proportion of most classes of samples in the natural nearest neighbor set of (1), wherein +.>
Figure SMS_350
,/>
Figure SMS_354
,/>
Figure SMS_356
Therefore, it is
Figure SMS_346
,/>
Figure SMS_351
By the formula->
Figure SMS_353
Obtaining the product
Figure SMS_357
,/>
Figure SMS_347
Step 6, firstly determining the number of samples to be generated according to the formula
Figure SMS_358
,/>
Figure SMS_359
Default to 1, get->
Figure SMS_360
From the formula
Figure SMS_361
Available core sample Point->
Figure SMS_362
The number of samples to be generated is +.>
Figure SMS_363
From the formula
Figure SMS_364
Figure SMS_365
A new sample may be obtained as [1.0732, -0.504, -5.1509,6.7533, -8.4891,0.8524, -7.7515];
The new sample set is
Figure SMS_366
,/>
Figure SMS_367
The specific data of (2) are as follows:
{1.0732,-0.504,-5.1509,6.7533,-8.4891,0.8524,-7.7515
1.1313,-0.5899,-5.3199,6.7124,-8.5055,0.803,-7.7187
1.1397,-0.6022,-5.3443,6.7065,-8.5078,0.7959,-7.714
……
1.1074,-0.5546,-5.2505,6.7292,-8.4988,0.8233,-7.7322}。
the embodiment of the invention takes a credit card abnormal data set comprising a minority class sample set consisting of a plurality of minority class samples and a majority class sample set consisting of a plurality of majority class samples as an unbalanced data set; randomly selecting part of minority class samples in a minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; calculating the proportion of the core sample points in most types of samples in each natural nearest neighbor set according to the spatial distribution condition of each sample in the unbalanced data set; according to the proportion of most samples in each natural nearest neighbor set, determining the spatial distribution condition of each core sample point in an unbalanced data set, the number weight of new samples generated in the natural nearest neighbor and the position weight of new sample points generated in the natural nearest neighbor; according to the quantity weight and the position weight, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain, acquiring a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to acquire a balanced data set for predicting financial fraud; compared with the prior art, the method solves the problem that the neighbor value needs to be frequently determined in the traditional oversampling method by introducing the natural nearest neighbor method, can realize self-adaptive selection of sample adjacent points, eliminates interference of outlier points on sample characteristics in a balance data set, adaptively distributes the number of samples required to be generated according to the distribution state of data around a few sample points in the neighborhood in the formed natural neighbor, improves the quality of the generated samples, enlarges the range of the generated samples, and improves the precision of predicting financial fraud behaviors.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the unbalanced data like oversampling method when being executed by a processor.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the implementation of all or part of the flow of the method of the foregoing embodiments of the present invention may be accomplished by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the foregoing method embodiments when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to construct an apparatus/terminal equipment, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The embodiment of the invention also provides a terminal device which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the unbalanced data like oversampling method when executing the computer program.
It should be noted that the terminal device may be a mobile phone, a tablet computer, a notebook computer, an Ultra mobile personal computer (UMPC, ultra-mobile Personal Computer), a netbook, a personal digital assistant (PDA, personal Digital Assistant), or the like, and the terminal device may be a station (ST, stand) in a WLAN, for example, a cellular phone, a cordless phone, a session initiation protocol (SIP, session Initiation Protocol) phone, a wireless local loop (WLL, wireless Local Loop) station, a personal digital processing (PDA, personal Digital Assistant) device, a handheld device having a wireless communication function, a computing device, or other processing device connected to a wireless modem, a computer, a laptop computer, a handheld communication device, a handheld computing device, a satellite wireless device, or the like. The embodiment of the invention does not limit the specific type of the terminal equipment.
The processor may be a central processing unit (CPU, central Processing Unit), but may also be other general purpose processors, digital signal processors (DSP, digital Signal Processor), application specific integrated circuits (ASIC, application Specific Integrated Circuit), off-the-shelf programmable gate arrays (FPGA, field-Programmable Gate Array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may in some embodiments be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may in other embodiments also be an external storage device of the terminal device, such as a plug-in hard disk provided on the terminal device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. Further, the memory may also include both an internal storage unit and an external storage device of the terminal device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs, etc., such as program code for the computer program, etc. The memory may also be used to temporarily store data that has been output or is to be output.
It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be found in the method embodiment section, and will not be described herein.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims (10)

1. A method for oversampling unbalanced data, comprising:
step 1, acquiring a credit card abnormal transaction data set to be processed, wherein the credit card abnormal transaction data set is used as an unbalanced data set, and the unbalanced data set comprises a minority sample set consisting of a plurality of minority samples and a majority sample set consisting of a plurality of majority samples;
step 2, randomly selecting part of minority class samples in the minority class sample set as core sample points, and determining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set; each of the natural nearest neighbor sets includes a plurality of nearest neighbor elements of the core sample point;
step 3, calculating the proportion of the majority sample in each natural nearest neighbor set according to the space distribution condition of each sample in the unbalanced data set;
step 4, determining the spatial distribution condition of each core sample point in the unbalanced data set according to the proportion of the majority sample in each natural nearest neighbor set;
step 5, determining the number weight of the new samples generated in the natural nearest neighbor domain according to the spatial distribution condition of each core sample point in the unbalanced data set;
step 6, determining the position weight of a new sample point generated in each natural nearest neighbor according to the spatial distribution condition of each core sample point in the unbalanced data set;
and 7, acquiring sample characteristics of a new sample generated in each natural nearest neighbor domain according to the quantity weight and the position weight, obtaining a new sample set based on the sample characteristics, and summarizing the new sample set and the unbalanced data set to obtain a balanced data set for predicting financial fraud.
2. The method of oversampling data in class unbalance according to claim 1, comprising, before said step 2:
and calculating the standard Euclidean distance between the two minority class samples, wherein the formula is as follows:
Figure QLYQS_1
wherein ,
Figure QLYQS_3
indicate->
Figure QLYQS_10
Minority class sample->
Figure QLYQS_17
And->
Figure QLYQS_9
Minority class sample->
Figure QLYQS_16
Distance between (2) and (2)>
Figure QLYQS_5
、/>
Figure QLYQS_14
Respectively represent +.>
Figure QLYQS_6
Minority class sample->
Figure QLYQS_13
First->
Figure QLYQS_2
Minority class sample->
Figure QLYQS_11
In->
Figure QLYQS_4
The values in the dimensions of the features of the individual samples,
Figure QLYQS_12
representing a minority class sample point set +.>
Figure QLYQS_8
In->
Figure QLYQS_15
Standard deviation in the characteristic dimension of individual samples +.>
Figure QLYQS_7
Is the number of sample features.
3. The unbalanced-like data oversampling method of claim 2, wherein step 2 comprises:
randomly selecting a plurality of minority class samples in the minority class sample set as core sample points;
selecting the core sample points for each core sample point
Figure QLYQS_18
Each neighbor element;
selecting the core sample point
Figure QLYQS_19
The neighboring elements constitute->
Figure QLYQS_20
Neighbor set->
Figure QLYQS_21
Regarding the minority class samples except the core sample point in the minority class sample set, if the nearest neighbor set of the minority class samples contains the core sample point, the minority class samples are considered to be the inverse of the core sample point
Figure QLYQS_22
Neighbor element, said inverse->
Figure QLYQS_23
Neighbor element composition inverse->
Figure QLYQS_24
Neighbor set->
Figure QLYQS_25
Aiming at the minority class samples except the core sample points in the minority class sample set, if the nearest neighbor set of the minority class samples does not contain the core sample points, the minority class samples are considered to be outliers, and the minority class samples are discarded;
solving for the said
Figure QLYQS_26
Neighbor set->
Figure QLYQS_27
Is>
Figure QLYQS_28
Neighbor set->
Figure QLYQS_29
Is a complex of the intersection of (a) and (b);
redefining if the intersection is empty
Figure QLYQS_30
Repeatedly selecting +.>
Figure QLYQS_31
Neighbor set and inverse
Figure QLYQS_32
A neighbor set;
if the intersection is a non-empty set, then the natural nearest neighbor set is
Figure QLYQS_33
Redefinition of
Figure QLYQS_34
Repeatedly find the value of natural nearest neighbor set +.>
Figure QLYQS_35
Up to the core sampleInverse of this point
Figure QLYQS_36
And obtaining a natural nearest neighbor set of each core sample point and a natural nearest neighbor corresponding to each natural nearest neighbor set without changing the neighbor set.
4. A method of oversampling class-unbalanced data as claimed in claim 3, wherein the proportion of the majority class samples in each of the natural nearest neighbor sets is calculated by:
Figure QLYQS_37
wherein ,
Figure QLYQS_38
indicating that the majority class sample is at +.>
Figure QLYQS_39
The proportion of the natural nearest neighbor set, +.>
Figure QLYQS_40
Is->
Figure QLYQS_41
The number of most classes of samples in the natural nearest neighbor set,/->
Figure QLYQS_42
Representing the number of neighbor elements of the core sample point.
5. The method of oversampling data in class unbalance of claim 4, wherein the step 4 comprises:
according to the proportion of the majority sample in each natural nearest neighbor set;
if it is
Figure QLYQS_43
,/>
Figure QLYQS_44
If it is
Figure QLYQS_45
,/>
Figure QLYQS_46
If it is
Figure QLYQS_47
,/>
Figure QLYQS_48
wherein ,
Figure QLYQS_49
generating control weights for samples of core sample points, +.>
Figure QLYQS_50
For controlling parameters +.>
Figure QLYQS_51
Generating control weights from the samples
Figure QLYQS_52
Determining the spatial distribution of each core sample point in the unbalanced data set.
6. The method of claim 5, wherein the number weights of new samples generated in the natural nearest neighbor are based on a number of the new samples
Figure QLYQS_53
The method comprises the following steps:
Figure QLYQS_54
wherein ,
Figure QLYQS_55
generating control weights for samples of core sample points, +.>
Figure QLYQS_56
Representation->
Figure QLYQS_57
Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
7. The method of claim 6, wherein the location weights of the new samples generated in the natural nearest neighbor are:
Figure QLYQS_58
wherein ,
Figure QLYQS_59
generating control weights for samples of core sample points, +.>
Figure QLYQS_60
Representation->
Figure QLYQS_61
Samples of core sample points in a natural nearest neighbor generate a sum of control weights.
8. The method of oversampling of data in class unbalance of claim 7, wherein the step 7 comprises:
determining the number of new samples to be generated in the unbalanced dataset, wherein the expression is:
Figure QLYQS_62
wherein ,
Figure QLYQS_63
for balancing parameters for controlling the number of new samples, +.>
Figure QLYQS_64
Calculating the number of new samples required to be generated in each natural nearest neighbor domain, wherein the expression is as follows:
Figure QLYQS_65
generating a formula according to the region sample generation for each natural nearest neighbor
Figure QLYQS_66
Sample characteristics of the new samples, the regional sample generation formula is:
Figure QLYQS_67
wherein ,
Figure QLYQS_68
representing +.>
Figure QLYQS_69
The first ∈of the new sample point generated>
Figure QLYQS_70
Sample characteristics,/->
Figure QLYQS_71
Representing core sample points and in natural nearest neighborsSample characteristic differences of other sample points, +.>
Figure QLYQS_72
Is a random number with the value range of 0,1];
Obtaining a new sample as the sample characteristic of the new sample generated in each natural nearest neighbor domain
Figure QLYQS_73
New sample->
Figure QLYQS_74
By->
Figure QLYQS_75
A sample feature formation;
from the following components
Figure QLYQS_76
Combining the new samples to obtain a new sample set of +.>
Figure QLYQS_77
And summarizing the new sample set and the unbalanced data set to obtain a balanced data set.
9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the unbalance-like data oversampling method according to any of the claims 1 to 7.
10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the unbalance-like data oversampling method according to any one of claims 1 to 7 when executing the computer program.
CN202310397766.7A 2023-04-14 2023-04-14 Unbalanced data oversampling method and related equipment Active CN116108387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310397766.7A CN116108387B (en) 2023-04-14 2023-04-14 Unbalanced data oversampling method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310397766.7A CN116108387B (en) 2023-04-14 2023-04-14 Unbalanced data oversampling method and related equipment

Publications (2)

Publication Number Publication Date
CN116108387A true CN116108387A (en) 2023-05-12
CN116108387B CN116108387B (en) 2023-07-04

Family

ID=86264176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310397766.7A Active CN116108387B (en) 2023-04-14 2023-04-14 Unbalanced data oversampling method and related equipment

Country Status (1)

Country Link
CN (1) CN116108387B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868775A (en) * 2016-03-23 2016-08-17 深圳市颐通科技有限公司 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN110275910A (en) * 2019-06-20 2019-09-24 东北大学 A kind of oversampler method of unbalanced dataset
CN112633426A (en) * 2021-03-11 2021-04-09 腾讯科技(深圳)有限公司 Method and device for processing data class imbalance, electronic equipment and storage medium
KR20220007470A (en) * 2020-07-10 2022-01-18 박수환 A Design of a Location-based Fraud Detection System in Mobile Payment Service Device and Operation Method using Machine Learning Technique
CN114862404A (en) * 2022-05-05 2022-08-05 湖北工业大学 Credit card fraud detection method and device based on cluster samples and limit gradients
US20220383322A1 (en) * 2021-05-30 2022-12-01 Actimize Ltd. Clustering-based data selection for optimization of risk predictive machine learning models

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868775A (en) * 2016-03-23 2016-08-17 深圳市颐通科技有限公司 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN110275910A (en) * 2019-06-20 2019-09-24 东北大学 A kind of oversampler method of unbalanced dataset
KR20220007470A (en) * 2020-07-10 2022-01-18 박수환 A Design of a Location-based Fraud Detection System in Mobile Payment Service Device and Operation Method using Machine Learning Technique
CN112633426A (en) * 2021-03-11 2021-04-09 腾讯科技(深圳)有限公司 Method and device for processing data class imbalance, electronic equipment and storage medium
US20220383322A1 (en) * 2021-05-30 2022-12-01 Actimize Ltd. Clustering-based data selection for optimization of risk predictive machine learning models
CN114862404A (en) * 2022-05-05 2022-08-05 湖北工业大学 Credit card fraud detection method and device based on cluster samples and limit gradients

Also Published As

Publication number Publication date
CN116108387B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
US9953160B2 (en) Applying multi-level clustering at scale to unlabeled data for anomaly detection and security
Kuehnhausen et al. Trusting smartphone apps? To install or not to install, that is the question
US20190035015A1 (en) Method and apparatus for obtaining a stable credit score
WO2021159766A1 (en) Data identification method and apparatus, and device, and readable storage medium
US10504028B1 (en) Techniques to use machine learning for risk management
WO2020181907A1 (en) Decision-making optimization method and apparatus
CN111090780A (en) Method and device for determining suspicious transaction information, storage medium and electronic equipment
CN109598414A (en) Risk evaluation model training, methods of risk assessment, device and electronic equipment
US20200286091A1 (en) Automated multi-currency refund service
WO2023009590A1 (en) Authenticating based on user behavioral transaction patterns
CN111275416A (en) Digital currency abnormal transaction detection method and device, electronic equipment and medium
CN111582872A (en) Abnormal account detection model training method, abnormal account detection device and abnormal account detection equipment
CN116108387B (en) Unbalanced data oversampling method and related equipment
CN111275071B (en) Prediction model training method, prediction device and electronic equipment
CN111245815A (en) Data processing method, data processing device, storage medium and electronic equipment
CN112446777A (en) Credit evaluation method, device, equipment and storage medium
CN115481300A (en) Data imbalance classification oversampling method, device, equipment and medium based on natural neighborhood density
CN114003648B (en) Identification method and device for risk transaction group partner, electronic equipment and storage medium
CN108235228B (en) Safety verification method and device
CN112488825B (en) Object transaction method and device based on blockchain
CN115601044A (en) Fraud detection model training method, fraud detection device and electronic equipment
CN113177609A (en) Method, device, system and storage medium for processing data class imbalance
CN113988670A (en) Comprehensive enterprise credit risk early warning method and system
CN114706899A (en) Express delivery data sensitivity calculation method and device, storage medium and equipment
CN111860655A (en) User processing method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant