CN112001425A - Data processing method and device and computer readable storage medium - Google Patents

Data processing method and device and computer readable storage medium Download PDF

Info

Publication number
CN112001425A
CN112001425A CN202010743665.7A CN202010743665A CN112001425A CN 112001425 A CN112001425 A CN 112001425A CN 202010743665 A CN202010743665 A CN 202010743665A CN 112001425 A CN112001425 A CN 112001425A
Authority
CN
China
Prior art keywords
majority
samples
sample
dimension
minority
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010743665.7A
Other languages
Chinese (zh)
Other versions
CN112001425B (en
Inventor
马振伟
邹勇
林芃
孙浩然
肖鹰东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202010743665.7A priority Critical patent/CN112001425B/en
Publication of CN112001425A publication Critical patent/CN112001425A/en
Application granted granted Critical
Publication of CN112001425B publication Critical patent/CN112001425B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a data processing method, a device, a system and a computer readable storage medium, wherein the method comprises the following steps: acquiring a training sample set, wherein the training sample set comprises M majority samples and N minority samples, M, N is a positive integer, and M is greater than N; according to the minority samples and the dimensional characteristics of the majority samples, M majority samples which are discretely distributed around each minority sample are determined to carry out down-sampling on the majority samples, wherein M is smaller than M and is a positive integer; training the classification model according to the minority samples and the majority samples after down-sampling; and processing the data according to the classification model. By using the method, all information of the minority samples can be reserved, and the majority samples subjected to down-sampling processing are distributed around the minority samples in a discrete mode, so that the characteristics with discrimination can be reserved better, and a classification model with more accurate classification effect can be trained.

Description

Data processing method and device and computer readable storage medium
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a data processing method and device and a computer readable storage medium.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In machine learning modeling, there are many unbalanced data sets, that is, the proportion of samples in different categories is very different, for example, in classification information recommendation, image processing, transaction data analysis models, etc., the proportion of abnormal samples is only one ten thousandth, even one several tens of ten thousandth.
In dealing with unbalanced data, there are two most common methods: oversampling and undersampling. The former is to keep all the majority samples and repeat the random sampling of the minority samples, and the latter is to keep all the minority samples and not repeat the random sampling of part of the majority samples, both of which are for the purpose of less imbalance of the final sample class. Such random sampling may cause loss of sample information, and cause that the model cannot learn features with discrimination, thereby affecting the model effect and further affecting the accuracy of data processing.
Disclosure of Invention
In view of the above problems in the prior art, a data processing method, an apparatus and a computer-readable storage medium are provided, by which the above problems can be solved.
The present invention provides the following.
In a first aspect, a data processing method is provided, including: acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N; according to the minority samples and the dimensional characteristics of the majority samples, M majority samples which are discretely distributed around each minority sample are determined to carry out down-sampling on the majority samples, wherein M is smaller than M and is a positive integer; and processing the classification model according to the minority class samples and the majority class samples after down-sampling, and processing the data according to the classification model.
According to one possible embodiment, the m majority samples discretely distributed around each minority sample are determined, further comprising: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining the discrete difference degree L between m majority samples contained in each group of majority samples; determining the distance D between any one minority sample and m majority samples contained in each group of majority samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree LmL/D; according to the degree of difference SmOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.
According to one possible embodiment, determining the degree of difference L in dispersion between the m majority samples included in each group of majority samples comprises: each dimension characteristic of the majority class samples and the minority class samples comprises nsNumerical features of dimension and/or nfDescriptive data of the dimension; for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And, according to the degree of variance LsAnd/or a degree of variance LfThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.
According to one possible embodiment, for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samplessThe method also comprises the following steps: n from M majority samplessNumerical characteristics of the dimension, at nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditionssDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nsDegree of dimensional dispersion, obtaining m majority samplesDegree of difference of dispersion Ls
According to one possible embodiment, the method further comprises: determining the discrete difference L of m majority samples by using the following formulas
Figure BDA0002607573500000031
Wherein n issIs the dimension of a numerical feature, ktTo aim at nsThe number of cells divided by each dimension in the dimension,
Figure BDA0002607573500000032
the number of the majority samples of the m majority samples falling in each divided cell is shown.
According to one possible embodiment, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesfThe method also comprises the following steps: for nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elementsfDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nfDegree of dimension dispersion, obtaining degree of dispersion L of m majority samplesf
According to one possible embodiment, the method further comprises: determining the discrete difference L of m majority samples by using the following formulaf
Figure BDA0002607573500000033
Wherein n isfIn order to describe the dimensions of a type feature,
Figure BDA0002607573500000034
representation collection
Figure BDA0002607573500000035
Number of different elements in, set
Figure BDA0002607573500000036
The descriptive features representing m majority class samples are directed to a feature set of the same dimension.
According to one possible implementation, determining the difference degree of each group of the plurality of types of samples according to the distance and the dispersion difference degree comprises: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.
According to one possible embodiment, the method further comprises: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain
Figure BDA0002607573500000038
Grouping a plurality of sample groups; determining
Figure BDA0002607573500000037
Degree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the method further comprising: calculating the distance between the M majority samples and any one minority sample according to the numerical characteristic aiming at any one minority sample; sequencing the M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; determining the difference degree S of q groups of majority sample groupsmSelecting a group of majority samples with the maximum difference as any minority sample corresponding to the majority sampleThe result of the sampling of (1).
According to one possible embodiment, the method further comprises: calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij
Figure BDA0002607573500000041
Wherein, a plurality of types of samples XiIncluding nsNumerical characteristics of dimensions
Figure BDA0002607573500000042
Minority sample YjIncluding nsNumerical data of dimension
Figure BDA0002607573500000043
i=1,2,...,M,j=1,2,...,N。
In a second aspect, a data processing apparatus is provided, including: the acquisition module is used for acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is larger than N; the down-sampling module is used for determining M majority samples which are discretely distributed around each minority sample according to the minority samples and the dimensionality characteristics of the majority samples so as to down-sample the majority samples, wherein M is smaller than M and is a positive integer; the training module is used for training the classification model according to the minority class samples and the majority class samples after down sampling; and the processing module is used for processing the data according to the classification model.
According to one possible implementation, the down-sampling module is further configured to: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining the discrete difference degree L between m majority samples contained in each group of majority samples; determining any one minority class sample and m majority classes contained in each majority class sample groupThe distance D between samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree LmL/D; according to the degree of difference SmOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.
According to one possible implementation, the down-sampling module is further configured to: majority class samples and minority class samples comprise nsNumerical features of dimension and/or nfDescriptive data of the dimension; for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And, according to the degree of variance LsAnd/or a degree of variance LfThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.
According to one possible implementation, the down-sampling module is further configured to: n from M majority samplessNumerical characteristics of the dimension, at nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditionssDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nsDimension dispersion degree, obtaining dispersion difference degree L of m most sampless
According to one possible implementation, the down-sampling module is further configured to: determining the discrete difference L of m majority samples by using the following formulas
Figure BDA0002607573500000051
Wherein n issIs the dimension of a numerical feature, ktTo aim at nsEach dimension of the dimension is divided into cellsThe number of the first and second groups is,
Figure BDA0002607573500000052
the number of the majority samples of the m majority samples falling in each divided cell is shown.
According to one possible implementation, the down-sampling module is further configured to: for nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elementsfDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nfDegree of dimension dispersion, obtaining degree of dispersion L of m majority samplesf
According to one possible implementation, the down-sampling module is further configured to: determining the discrete difference L of m majority samples by using the following formulaf
Figure BDA0002607573500000061
Wherein n isfIn order to describe the dimensions of a type feature,
Figure BDA0002607573500000062
representation collection
Figure BDA0002607573500000063
Number of different elements in, set
Figure BDA0002607573500000064
The descriptive features representing m majority class samples are directed to a feature set of the same dimension.
According to one possible implementation, determining the difference degree of each group of the plurality of types of samples according to the distance and the dispersion difference degree comprises: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.
According to one possible embodiment, the device further comprises: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain
Figure BDA0002607573500000069
Grouping a plurality of sample groups; determining
Figure BDA0002607573500000068
Degree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the apparatus further comprising: calculating the distance between the M majority samples and any one minority sample according to the numerical characteristic aiming at any one minority sample; sequencing the M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; determining the difference degree S of q groups of majority sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the method further comprises: calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij
Figure BDA0002607573500000065
Wherein, a plurality of types of samples XiIncluding nsNumerical characteristics of dimensions
Figure BDA0002607573500000066
Minority sample YjIncluding nsNumerical data of dimension
Figure BDA0002607573500000067
i=1,2,...,M,j=1,2,...,N。
In a third aspect, a data processing apparatus is provided, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform: the method of the first aspect.
In a fourth aspect, there is provided a computer readable storage medium storing a program which, when executed by a multicore processor, causes the multicore processor to perform the method of the first aspect.
The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects: in this embodiment, all information of a few types of samples acquired according to abnormal data is retained, downsampling processing is performed on a plurality of types of samples acquired according to normal data, and the plurality of types of samples subjected to downsampling processing are distributed around the few types of samples in a discrete manner by using each dimension information of the samples, so that the features with discrimination can be better learned during training of a classification data processing (such as information recommendation or image processing) model.
It should be understood that the above description is only an overview of the technical solutions of the present invention, so as to clearly understand the technical means of the present invention, and thus can be implemented according to the content of the description. In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
The advantages and benefits described herein, as well as other advantages and benefits, will be apparent to those of ordinary skill in the art upon reading the following detailed description of the exemplary embodiments. The drawings are only for purposes of illustrating exemplary embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like elements throughout. In the drawings:
FIG. 1 is a flow chart illustrating a data processing method according to an embodiment of the invention;
FIG. 2 is a flow chart illustrating a data processing method according to another embodiment of the present invention;
FIG. 3 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In the present invention, it is to be understood that terms such as "including" or "having," or the like, are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility of the presence of one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
Those skilled in the art will appreciate that the described application scenario is only one example in which an embodiment of the present invention may be implemented. The scope of applicability of the embodiments of the present invention is not limited in any way. Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Fig. 1 is a schematic flow chart diagram of a data processing method 100 for implementing optimized sampling of training samples according to an embodiment of the present application, in which, from a device perspective, an execution subject may be one or more electronic devices; from the program perspective, the execution main body may accordingly be a program loaded on these electronic devices.
As shown in fig. 1, the method 100 may include:
101, acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N;
step 102, according to the minority samples and the dimension characteristics of the majority samples, determining M majority samples which are discretely distributed around each minority sample so as to perform down-sampling on the majority samples, wherein M is smaller than M and is a positive integer;
and 103, training the classification model according to the minority class samples and the down-sampled majority class samples.
And 104, processing the data according to the classification model.
The classification model can be various classification models such as a classification information recommendation model, an image classification processing model, a transaction data analysis model and the like. In the processing of the unbalanced training sample set, the number of training samples of different classes is greatly different, and the number of samples is far more than the number of samples.
For example, in a training sample of an image classification model for lesion recognition, the number of samples of a majority class collected from normal data (such as image data of a healthy organ) is much larger than the number of samples of a minority class collected from abnormal data (such as image data of a diseased organ). For another example, in the training samples of the transaction data analysis model, the number of most types of samples obtained from normal data (such as normal transaction data) is much larger than the number of few types of samples obtained from abnormal data (such as fraudulent transaction data). Each sample contains a plurality of dimensional features such as numerical data, date data, category data, textual description data, and the like.
The present application can classify sample characteristics into two categories: (1) numerical data: the method has specific numerical values, can quantify the distance between samples after standardization, and can also be used for model training after sampling. (2) Descriptive data: data which is difficult to process numerically and has a function of distinguishing categories, such as date data, category data, word description data and the like.
In this embodiment, all the minority samples are retained, each dimension characteristic of the sample is fully utilized, and m majority samples discretely surrounding the minority samples are selected for each minority sample, so that the characteristic with the discrimination can be better learned during model training. Specifically, picking out m majority class samples for each minority class sample may have the following characteristics: (1) the method is as close as possible to the few samples and keeps certain similarity with the few samples; (2) under the condition of the same distance from a few types of samples, the samples are dispersed as much as possible so as to keep more sample information; (3) and sample information dispersion and sample distance are combined to ensure stable sampling.
In a possible implementation, the step 102 may further include:
step 201, for any one minority sample, sampling a majority sample group of multiple combinations from M majority samples, where each majority sample group includes M majority samples, and M is a positive integer and smaller than M.
In a possible implementation manner, based on the global optimal idea, the step 201 may further include: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain
Figure BDA0002607573500000101
Grouping a plurality of sample groups; determining
Figure BDA0002607573500000102
Degree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
In fact, m majority samples are sampled for each minority sample, and N × m majority samples are finally obtained, and considering that the same majority sample is collected by different minority samples, the number of the majority samples finally obtained by sampling is between [ m, N × m ].
In one possible embodiment, the majority of the classes of samples XiIncluding nsNumerical characteristics of dimensions
Figure BDA0002607573500000103
Minority sample YjIncluding nsNumerical characteristics of dimensions
Figure BDA0002607573500000104
Based on part of the optimal idea, the step 201 may further include: for any one of a few classes of samples YjCalculating the distance between the M majority samples and any one minority sample according to the numerical characteristic; sequencing the M majority samples according to the distance to obtain a majority sample sequence (X)j1,Xj2,...,XjM) (ii) a Selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence. For example: selecting group 1 as (X)j1,Xj2,...,Xjm) The second group is (X)j2,Xj3,...,Xjm+1) And so on. Further, the difference degree S of q groups of majority sample groups can be determinedmAnd selecting a group of multi-class sample groups with the maximum difference as m multi-class samples which are distributed around any one of the few class samples in a discrete mode.
In one possible embodiment, any one of the majority sample types X may be calculated using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij
Figure BDA0002607573500000105
Wherein, a plurality of types of samples XiIncluding nsNumerical characteristics of dimensions
Figure BDA0002607573500000106
Minority sample YjIncluding nsNumerical data of dimension
Figure BDA0002607573500000107
i=1,2,...,M,j=1,2,...,N。
Step 202, determining the discrete difference degree L between the m majority samples included in each majority sample group.
In one possible embodiment, the majority class samples and the minority class samples comprise nsNumerical features of dimension and/or nfDescriptive data of the dimension; the step 202 may further include:
for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority sampless. And/or, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And, according to the degree of variance LsAnd/or a degree of variance LfDetermining a degree of difference L in dispersion between the m majority samples included in each group of majority samples, for example: l ═ alpha1Ls2LfIn which α is1And alpha2The weight information can be obtained according to historical data.
In this embodiment, the discrete difference L between the m majority samples included in each group of majority samples may be obtained according to the numerical characteristic, the descriptive characteristic, or a combination thereof, and may be adjusted according to an actual situation.
In one possible embodiment, the above is for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samplessThe method also comprises the following steps: n from M majority samplessNumerical characteristics of the dimension innsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditionssDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nsDimension dispersion degree, obtaining dispersion difference degree L of m most sampless
For example, for a few classes of samples YjLet m majority samples of a majority sample group for which the discrete difference is currently calculated be: (X)j1,Xj2,...,Xjm) The numerical characteristics of the m majority samples are:
Figure BDA0002607573500000111
Figure BDA0002607573500000112
…;
Figure BDA0002607573500000113
in fact, each dimension of the numerical characteristic is a numerical value, and after standardization processing, each dimension has a value interval. And considering the numerical characteristics of all M majority samples, dividing a value interval into a plurality of cell intervals according to each dimension, and observing the distribution of the sampled M majority samples on the cell intervals so as to judge the discrete degree of the samples. Taking the first dimension of the numerical characteristic as an example, the first dimension of the numerical characteristic of all M majority samples is divided into k values in a value interval1And (4) a small interval. The numerical characteristics of the first dimension of all m majority samples are
Figure BDA0002607573500000121
Falls on k1The number of the cells is respectively
Figure BDA0002607573500000122
The following relationships apply:
Figure BDA0002607573500000123
consider that
Figure BDA0002607573500000124
At k1The degree of dispersion between the cells can be determined by, for example, using the following formula
Figure BDA0002607573500000125
Figure BDA0002607573500000126
Wherein the content of the first and second substances,
Figure BDA0002607573500000127
value is (0, 1)]When in
Figure BDA0002607573500000128
Is uniformly distributed in k1When in a cell, i.e.
Figure BDA0002607573500000129
At this time
Figure BDA00026075735000001210
Take the maximum value of 1.
For minority class sample YjThe discrete degree of the numerical characteristics of the m majority sample types is defined as:
Figure BDA00026075735000001211
it can be seen that LsValue is (0, 1)]In between, the more discrete the numerical features of the majority of classes of samples, LsThe closer the value is to 1.
In one possible embodiment, the above is for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesfThe method also comprises the following steps: for nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elementsfDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nfDegree of dimension dispersion, obtaining degree of dispersion L of m majority samplesf
For example, for a few classes of samples YjLet m majority samples of a majority sample group for which the discrete difference is currently calculated be: (X)j1,Xj2,...,Xjm) The descriptive characteristics of the m majority samples are as follows:
Figure BDA00026075735000001212
Figure BDA00026075735000001213
…;
Figure BDA0002607573500000131
in fact, each dimension of the descriptive information reflects the attribute of the sample, and for the descriptive information of a certain dimension, the more discrete the attribute distribution of the sample, the more information the sample contains.
For example, with Un{x1,x2,...,xnDenotes the set x1,x2,...,xnThe discrete distance of m majority samples in the first dimension of the descriptive feature can be determined by using the following formulaDegree of rotation
Figure BDA0002607573500000132
Figure BDA0002607573500000133
Wherein the content of the first and second substances,
Figure BDA0002607573500000134
is taken as (0, 1)]The more discrete the descriptive features of the first dimension,
Figure BDA0002607573500000135
the closer the value is to 1.
For minority class sample YjThe discrete degree of the description-type features of the m most sampled samples is defined as:
Figure BDA0002607573500000136
as can be seen,
Figure BDA0002607573500000137
value is (0, 1)]In between, the more discrete the numerical information of most classes of samples,
Figure BDA0002607573500000138
the closer the value is to 1.
Step 203, determining the distance D between any one minority sample and the m majority samples included in each group of majority samples.
For example, consider a sequence of samples from a majority class (X)j1,Xj2,...,XjM) M majority of selected samples (X)ji+1,Xji+2,...,Xji+m) With minority class samples YjDetermining all selected m majority samples and minority samples YjSum of distances between:
D=β*[d(Yj,Xji+1)+…+d(Yj,Xji+m)]wherein β is a weight parameter.
Next, step 204 is executed to determine the difference S of each group of majority samples according to the distance D and the dispersion difference Lm=L/D。
In a possible implementation, the step 204 may further include: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.
Next, step 205 is executed according to the difference SmOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples. For example, the degree of difference S is selectedmThe largest set of majority sample groups is down sampled.
In one exemplary embodiment, in the transaction, the normal transaction is a majority transaction and may be scored as a majority sample, and the fraudulent transaction is a very few transaction and may be scored as a minority sample. All samples contain mainly two types of information: (1) descriptive characteristics: card number, merchant number, date, terminal number, etc.; (2) numerical characteristics: and obtaining characteristic data through the transaction message and the historical transaction information. The number of most samples is recorded as M, the number of few samples is recorded as N, and M is far greater than N.
Based on this, a plurality of samples are recorded
Figure BDA0002607573500000141
Where i 1, 2.., M, a small number of samples
Figure BDA0002607573500000142
Wherein j is 1, 2., N,
Figure BDA0002607573500000143
and
Figure BDA0002607573500000144
is a numerical characteristic of being nsThe dimensional vectors are normalized,
Figure BDA0002607573500000145
and
Figure BDA0002607573500000146
is a descriptive feature, is anfA vector of dimensions.
Figure BDA0002607573500000147
Figure BDA0002607573500000148
Each dimension of the numerical characteristic of each sample is a numerical characteristic which is comprehensively calculated by the transaction message, and the numerical characteristic comprises the numerical characteristics of money, average transaction money, transaction period intervals, transaction stroke number and the like and also comprises partial combined structure numerical characteristics. Then, the euclidean distance between any one of the majority class samples and any one of the minority class samples can be calculated using the following formula:
Figure BDA0002607573500000149
further, a majority of sample diversity functions may be constructed using the numerical and descriptive characteristics of the samples.
For the numerical characteristics of the sample, the amount, the average transaction amount, the transaction period interval, and the number of transaction strokes are taken as examples. For minority class sample YjCalculating a plurality of sample sequences Xj1,Xj2,Xj3,...,XjMThe degree of dispersion of the numerical features of the m majority class samples. (1) Using amtjiRepresenting the transaction amount of the sample, and dividing the value interval into k1The number of cells falling into each cell is
Figure BDA00026075735000001410
(2) Use agv _ amtjiRepresenting the average transaction amount of the sample, and dividing k by the value-taking interval2The number of cells falling into each cell is
Figure BDA00026075735000001411
(3) Using tjiRepresenting the trade period interval of the sample, dividing k by the value-taking interval3The number of cells falling into each cell is
Figure BDA0002607573500000151
(4) Use ntjiRepresenting the number of transactions of the sample, dividing k by the value-taking interval4The number of cells falling into each cell is
Figure BDA0002607573500000152
I.e. the numerical characteristics of the sample are expressed as:
Figure BDA0002607573500000153
then for a numerical feature, the degree of dispersion of the multiple classes of samples can be expressed as:
Figure BDA0002607573500000154
for the descriptive characteristics of the sample, the card number, the merchant number and the transaction date are taken as examples, and the sample Y is used for a few typesjCalculating Xj1,Xj2,Xj3,...,XjMThe m majority class samples in (a) describe the degree of dispersion of the type feature. Use of CjiRepresents a sample XjiCard number information of (1), MjiRepresents a sample XjiMerchant information of (1), TjiRepresents a sample XjiTime information of, i.e. sample XjiThe descriptive characteristics of (a) are expressed as:
Figure BDA0002607573500000155
the degree of dispersion of the majority type of sample description features can be expressed as:
Figure BDA0002607573500000156
and combining the numerical characteristics and the descriptive characteristics to construct a discrete degree function of a plurality of types of samples:
L=α1*Ls2*Lf
further consider that from Xj1,Xj2,Xj3,...,XjMM majority samples and minority samples Y selected from the groupjThe distance of (c): d ═ β [ [ D ] (Y) ]j,Xji+1)+…+d(Yj,Xji+m)]All the selected m majority samples and minority samples YjThe sum of the distances between them.
In summary, based on the weight combination, the difference function may be:
Figure BDA0002607573500000157
optionally, a weighting parameter α may be taken1,α2β is 1 to simplify the calculation, i.e., the disparity function is:
Figure BDA0002607573500000158
optionally, for a few classes of samples YjThe M majority samples can be sorted from near to far as X according to Euclidean distancej1,Xj2,Xj3,...,XjMFurther, from the majority class sample sequence Xj1,Xj2,Xj3,...,XjMThe first group is that m majority samples are selected: xj1,Xj2,...,Xjm(ii) a Second group: xj2,Xj3,...,Xjm+1(ii) a …, and the likeMultiple sets of majority sample sets can be obtained.
The difference degree of m majority samples in each majority sample group can be calculated respectively, and the difference degree S is selectedmThe largest group is the m majority samples of the final sample.
In fact, for all N minority class samples, the number of majority class samples sampled from M majority class samples is between [ M, N × M ]. According to the experience of pseudo-card model training, a positive minority sample ratio of 1000: 1 is generally selected, that is, 1000 majority samples need to be sampled for each minority sample. And finally, performing model training by using the sampled majority samples and all minority samples.
In another exemplary embodiment, a technical effect of the data processing method of the present application is exemplarily described. For example, for a financial institution, the average transaction number 175538 per day, the fraud transaction number 422, and the ratio of majority class/minority class samples is 37437: 1, which are obtained in the magnetic stripe card online environment of 90 days in 2019 in 4, 5, 6, and 7 months, and the majority class samples are down-sampled and model-trained according to the scheme of the above embodiment. All of the majority samples can be first down-sampled at a 50: 1 sampling ratio, sampling 3510 transactions per day. Specifically, assuming that the sample has 500-dimensional numerical features, wherein 21-dimensional descriptive information exists, the 21-dimensional numerical features (e.g., 2 merchant statistical features, 3 card-dimensional statistical features, 6 real-time statistical features, and 10 historical statistical features) are selected in consideration of the saturation, interval distribution, and other factors of each dimensional information, and the 4 pieces of dimensional information are combined from the 2-dimensional descriptive information (e.g., merchant number and transaction total amount) to perform the downsampling process in the above embodiment.
(1) Sample information saturation: referring to table 1, the saturation of 19 dimensional features is greater than that of the random sampling scheme, wherein the maximum is 40.7% higher, 7 are 30% higher, and two saturations are slightly lower than that of the random sampling scheme, for 21 dimensional features obtained by the down-sampling scheme in the present application. Of all 500 dimensions, 436 dimensions have higher saturation than random sampling.
Table 1:
Figure BDA0002607573500000161
Figure BDA0002607573500000171
(2) sample feature segmentation distribution:
referring to table 2, it can be seen that 20 segment distribution differences of 21 dimensional features obtained by sampling with the down-sampling scheme of the present application are all smaller than those of the random sampling scheme, which is reduced by 37.5% to the maximum and 30% to 4; with 1 segment distribution differing slightly more than the random sampling scheme.
Table 2:
Figure BDA0002607573500000172
Figure BDA0002607573500000181
(3) description of information combination ratio:
the assumed combination of description information is: the total amount of transaction in the merchant number +5 minutes, the total amount of transaction in the merchant number +15 minutes, the total amount of transaction in the merchant number +120 minutes, and the total amount of transaction in the merchant number +1 day. Referring to table 3, it can be seen that the degree of dispersion of the descriptive information sampled by the down-sampling scheme of the present application is greater than that of the random sampling scheme.
Table 3:
index (I) Down sampling scheme Random sampling scheme Difference of difference
Merchant number +5min total transaction amount 76.5% 74.8% 1.7%
Merchant number +15min total transaction amount 78.7% 76.1% 2.6%
Merchant number +120min total transaction amount 81.8% 79.0% 2.8%
(4) The model training effect is as follows: substituting the data of months 4, 5 and 6 in 2019, namely the number of digits 383 of a few class samples into an xgboost model, wherein the parameters are as follows: max _ depth: 3; eta: 0.01; min _ child _ weight: 6; gamma: 0.1; lambda: 10; subsample: 0.8
With the same parameters above, see table 4, the behavior of the model on the training set is as follows:
table 4:
train-auc val-auc trees
down sampling scheme 0.941778 0.891961 568
Random sampling scheme 0.936219 0.891842 515
Referring to Table 5, the extrapolated validation effect on the 7-month data (38 minority samples) is:
table 5:
down sampling scheme Random sampling scheme
Recall rate Rate of accuracy Rate of accuracy
5.1% 28.6% 33.3%
10.3% 40.0% 19.0%
15.4% 10.0% 14.3%
Based on the same technical concept, an embodiment of the present invention further provides a data processing apparatus, configured to execute the data processing method provided in any of the above embodiments. Fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
As shown in fig. 3, the apparatus 300 includes:
an obtaining module 301, configured to obtain a training sample set, where the training sample set includes M majority samples obtained according to normal data and N minority samples obtained according to abnormal data, M, N is a positive integer, and M is greater than N;
a down-sampling module 302, configured to determine M majority samples discretely distributed around each minority sample according to the minority samples and the dimensionality characteristics of the majority samples, so as to down-sample the majority samples, where M is smaller than M and is a positive integer;
and the training module 303 is configured to train the classification model according to the minority class samples and the down-sampled majority class samples.
And the processing module 304 is used for processing the data according to the classification model.
According to one possible implementation, the down-sampling module 302 is further configured to: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining discrete differences between m majority samples contained in each set of majority samplesA degree L; determining the distance D between any one minority sample and m majority samples contained in each group of majority samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree LmL/D; according to the degree of difference SmOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.
According to one possible implementation, the down-sampling module 302 is further configured to: majority class samples and minority class samples comprise nsNumerical features of dimension and/or nfDescriptive data of the dimension; for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And, according to the degree of variance LsAnd/or a degree of variance LfThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.
According to one possible implementation, the down-sampling module 302 is further configured to: n from M majority samplessNumerical characteristics of the dimension, at nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditionssDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nsDimension dispersion degree, obtaining dispersion difference degree L of m most sampless
According to one possible implementation, the down-sampling module 302 is further configured to: determining the discrete difference L of m majority samples by using the following formulas
Figure BDA0002607573500000201
Wherein n issIs a number ofDimension, k, of a value-type featuretTo aim at nsThe number of cells divided by each dimension in the dimension,
Figure BDA0002607573500000202
the number of the majority samples of the m majority samples falling in each divided cell is shown.
According to one possible implementation, the down-sampling module 302 is further configured to: for nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elementsfDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nfDegree of dimension dispersion, obtaining degree of dispersion L of m majority samplesf
According to one possible implementation, the down-sampling module 302 is further configured to: determining the discrete difference L of m majority samples by using the following formulaf
Figure BDA0002607573500000211
Wherein n isfIn order to describe the dimensions of a type feature,
Figure BDA0002607573500000212
representation collection
Figure BDA0002607573500000213
Number of different elements in, set
Figure BDA0002607573500000214
The descriptive features representing m majority class samples are directed to a feature set of the same dimension.
According to one possible implementation, determining the difference degree of each group of the plurality of types of samples according to the distance and the dispersion difference degree comprises: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.
According to one possible embodiment, the device further comprises: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain
Figure BDA0002607573500000217
Grouping a plurality of sample groups; determining
Figure BDA0002607573500000216
Degree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the apparatus further comprising: calculating the distance between the M majority samples and any one minority sample according to the numerical characteristic aiming at any one minority sample; sequencing the M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; determining the difference degree S of q groups of majority sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the method further comprises: calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij
Figure BDA0002607573500000215
Wherein, a plurality of types of samples XiIncluding nsNumerical characteristics of dimensions
Figure BDA0002607573500000221
Minority sample YjIncluding nsNumerical data of dimension
Figure BDA0002607573500000222
i=1,2,...,M,j=1,2,...,N。
It should be noted that the data processing apparatus in the embodiment of the present application may implement each process of the foregoing embodiment of the data processing method, and achieve the same effect and function, which is not described herein again.
Fig. 4 is a data processing apparatus according to an embodiment of the present application, configured to execute the data processing method shown in fig. 1, where the apparatus includes: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the data processing method shown in the above embodiments
According to some embodiments of the present application, there is provided a non-volatile computer storage medium of a data processing method having stored thereon computer-executable instructions configured to, when executed by a processor, perform the data processing method illustrated in the embodiments described above
The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device, and computer-readable storage medium embodiments, the description is simplified because they are substantially similar to the method embodiments, and reference may be made to some descriptions of the method embodiments for their relevance.
The apparatus, the device, and the computer-readable storage medium provided in the embodiment of the present application correspond to the method one to one, and therefore, the apparatus, the device, and the computer-readable storage medium also have advantageous technical effects similar to those of the corresponding method.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (24)

1. A data processing method, comprising:
acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N;
according to the dimension characteristics of the minority samples and the majority samples, M majority samples which are distributed around each minority sample are determined to perform down-sampling on the majority samples, wherein M is smaller than M and is a positive integer;
training a classification model according to the minority samples and the majority samples after down-sampling;
and processing the data according to the classification model.
2. The method of claim 1, wherein determining m majority samples discretely distributed around each minority sample further comprises:
sampling multiple combined multiple sample groups from the M multiple samples for any one of the multiple samples, wherein each multiple sample group comprises M multiple samples;
determining a discrete difference degree L between m majority samples contained in each group of majority samples;
determining a distance D between the arbitrary one minority sample and the m majority samples included in each of the plurality of sets of majority samples;
determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree Lm=L/D;
According to the difference degree SmDetermining one of the plurality of combined majority sample sets as a plurality m of samples discretely distributed around the any one minority sampleSeveral types of samples.
3. The method of claim 2, wherein determining the degree of discrete difference L between the m majority samples included in each set of majority samples comprises:
each dimension feature of the majority class sample and the minority class sample comprises nsNumerical features of dimension and/or nfDescriptive data of the dimension;
for nsThe numerical characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or the presence of a gas in the gas,
for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And the number of the first and second groups,
according to the discrete difference degree LsAnd/or the degree of variance L of the dispersionfDetermining a degree of discrete difference L between m majority samples included in each set of majority samples.
4. The method of claim 3, wherein n is the lowest value of nsThe numerical characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplessThe method also comprises the following steps:
n according to the M majority samplessNumerical characteristics of the dimension, in said nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension;
determining the distribution situation of m majority samples contained in each group of majority samples among the plurality of cells;
determining the m majority samples in the n according to the distribution conditionsDegree of dispersion in each dimension of a dimension;
synthesizing the m majority samples at the nsThe degree of dispersion of the dimension is obtained, and the degree of dispersion difference L of the m majority samples is obtaineds
5. The method of claim 4, further comprising:
determining the discrete difference degree L of the m majority samples by using the following formulas
Figure FDA0002607573490000021
Wherein, said nsIs a dimension of a numerical feature, said ktTo aim at nsThe number of cells divided by each dimension in the dimension, the
Figure FDA0002607573490000022
And the number of the majority samples of the m majority samples falling in each divided cell is determined.
6. The method of claim 3, wherein n is the lowest value of nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesfThe method also comprises the following steps:
for said nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension;
determining the number of m majority class samples in the n according to the number of the minority class elementsfDegree of dispersion in each dimension of a dimension;
synthesizing the m majority samples at the nfThe degree of dispersion of the dimension is obtained, and the degree of dispersion L of the m majority samples is obtainedf
7. The method of claim 6, further comprising:
determining the discrete difference degree L of the m majority samples by using the following formulaf
Figure FDA0002607573490000031
Wherein, said nfFor the dimension of the descriptive feature, the
Figure FDA0002607573490000032
Representation collection
Figure FDA0002607573490000033
Number of different elements in, said set
Figure FDA0002607573490000034
The descriptive features representing the m majority class samples are for a feature set of the same dimension.
8. The method of claim 5, wherein determining the degree of difference for each of the plurality of sets of samples based on the distance and the degree of variance comprises:
according to the numerical characteristic, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group:
and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of the Euclidean distances.
9. The method of claim 2, further comprising:
for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain
Figure FDA0002607573490000035
Grouping a plurality of sample groups;
determining the
Figure FDA0002607573490000036
Degree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as the sampling result corresponding to any one minority sample.
10. The method of claim 2, wherein the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the method further comprising:
for any one minority sample, calculating the distance between the M majority samples and the any one minority sample according to the numerical characteristic;
sequencing the M majority samples according to the distance to obtain a majority sample sequence;
selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence;
determining the degree of difference S of the q groups of majority sample groupsmAnd selecting a group of majority sample groups with the maximum difference as the sampling result corresponding to any one minority sample.
11. The method of claim 8 or 10, further comprising:
calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij
Figure FDA0002607573490000041
Wherein the majority class sample XiIncluding nsNumerical characteristics of dimensions
Figure FDA0002607573490000042
The minority sample YjIncluding nsNumerical data of dimension
Figure FDA0002607573490000043
Figure FDA0002607573490000044
12. A data processing apparatus, comprising:
an obtaining module, configured to obtain a training sample set, where the training sample set includes M majority samples obtained according to normal data and N minority samples obtained according to abnormal data, where M, N is a positive integer, and M is greater than N;
a down-sampling module, configured to determine M majority samples discretely distributed around each minority sample according to the minority sample and each dimensional feature of the majority sample, so as to down-sample the majority sample, where M is smaller than M and is a positive integer;
the training module is used for training a classification model according to the minority class samples and the majority class samples after down-sampling;
and the processing module is used for processing the data according to the classification model.
13. The apparatus of claim 12, wherein the downsampling module is further configured to:
sampling multiple combined multiple sample groups from the M multiple samples for any one of the multiple samples, wherein each multiple sample group comprises M multiple samples;
determining a discrete difference degree L between m majority samples contained in each group of majority samples;
determining a distance D between the arbitrary one minority sample and the m majority samples included in each of the plurality of sets of majority samples;
determining the majority sample group according to the distance D and the dispersion difference LDegree of difference Sm=L/D;
According to the difference degree SmDetermining one of the plurality of combined majority sample sets as m majority samples discretely distributed around the any one minority sample.
14. The apparatus of claim 13, wherein the downsampling module is further configured to:
the majority class samples and the minority class samples comprise nsNumerical features of dimension and/or nfDescriptive data of the dimension;
for nsThe numerical characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or the presence of a gas in the gas,
for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And the number of the first and second groups,
according to the discrete difference degree LsAnd/or the degree of variance L of the dispersionfDetermining a degree of discrete difference L between m majority samples included in each set of majority samples.
15. The apparatus of claim 14, wherein the downsampling module is further configured to:
n according to the M majority samplessNumerical characteristics of the dimension, in said nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension;
determining the distribution situation of m majority samples contained in each group of majority samples among the plurality of cells;
determining the m majority samples in the n according to the distribution conditionsDegree of dispersion in each dimension of a dimension;
synthesizing the m majority samples at the nsThe degree of dispersion of the dimension is obtained, and the degree of dispersion difference L of the m majority samples is obtaineds
16. The apparatus of claim 15, wherein the downsampling module is further configured to:
determining the discrete difference degree L of the m majority samples by using the following formulas
Figure FDA0002607573490000061
Wherein, said nsIs a dimension of a numerical feature, said ktTo aim at nsThe number of cells divided by each dimension in the dimension, the
Figure FDA0002607573490000062
And the number of the majority samples of the m majority samples falling in each divided cell is determined.
17. The apparatus of claim 14, wherein the downsampling module is further configured to:
for said nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension;
determining the number of m majority class samples in the n according to the number of the minority class elementsfDegree of dispersion in each dimension of a dimension;
synthesizing the m majority samples at the nfThe degree of dispersion of the dimension is obtained, and the degree of dispersion L of the m majority samples is obtainedf
18. The apparatus of claim 17, wherein the downsampling module is further configured to:
determining the discrete difference degree L of the m majority samples by using the following formulaf
Figure FDA0002607573490000063
Wherein, said nfFor the dimension of the descriptive feature, the
Figure FDA0002607573490000064
Representation collection
Figure FDA0002607573490000065
Number of different elements in, said set
Figure FDA0002607573490000066
The descriptive features representing the m majority class samples are for a feature set of the same dimension.
19. The apparatus of claim 13, wherein determining the degree of difference for each of the plurality of sets of samples according to the distance and the degree of variance comprises:
according to the numerical characteristic, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group:
and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of the Euclidean distances.
20. The apparatus of claim 13, further comprising:
for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain
Figure FDA0002607573490000071
Grouping a plurality of sample groups;
determining the
Figure FDA0002607573490000072
Degree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as the sampling result corresponding to any one minority sample.
21. The apparatus of claim 13, wherein the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the apparatus further comprising:
for any one minority sample, calculating the distance between the M majority samples and the any one minority sample according to the numerical characteristic;
sequencing the M majority samples according to the distance to obtain a majority sample sequence;
selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence;
determining the degree of difference S of the q groups of majority sample groupsmAnd selecting a group of majority sample groups with the maximum difference as the sampling result corresponding to any one minority sample.
22. The apparatus of claim 19 or 21, further comprising:
calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij
Figure FDA0002607573490000073
Wherein the majority class sample XiIncluding nsNumerical characteristics of dimensions
Figure FDA0002607573490000074
The minority sample YjIncluding nsNumerical data of dimension
Figure FDA0002607573490000075
Figure FDA0002607573490000076
23. A data processing apparatus, comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform: the method of any one of claims 1-11.
24. A computer-readable storage medium storing a program that, when executed by a multi-core processor, causes the multi-core processor to perform the method of any of claims 1-11.
CN202010743665.7A 2020-07-29 2020-07-29 Data processing method, device and computer readable storage medium Active CN112001425B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010743665.7A CN112001425B (en) 2020-07-29 2020-07-29 Data processing method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010743665.7A CN112001425B (en) 2020-07-29 2020-07-29 Data processing method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112001425A true CN112001425A (en) 2020-11-27
CN112001425B CN112001425B (en) 2024-05-03

Family

ID=73464171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010743665.7A Active CN112001425B (en) 2020-07-29 2020-07-29 Data processing method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112001425B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066540A (en) * 2021-03-19 2021-07-02 新疆大学 Method for preprocessing non-equilibrium fault sample of oil-immersed transformer

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032450A1 (en) * 2012-07-30 2014-01-30 Choudur Lakshminarayan Classifying unclassified samples
CN103645249A (en) * 2013-11-27 2014-03-19 国网黑龙江省电力有限公司 Online fault detection method for reduced set-based downsampling unbalance SVM (Support Vector Machine) transformer
US20150088791A1 (en) * 2013-09-24 2015-03-26 International Business Machines Corporation Generating data from imbalanced training data sets
US20170372222A1 (en) * 2016-06-24 2017-12-28 Varvara Kollia Technologies for detection of minority events
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN109978009A (en) * 2019-02-27 2019-07-05 广州杰赛科技股份有限公司 Behavior classification method, device and storage medium based on wearable intelligent equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032450A1 (en) * 2012-07-30 2014-01-30 Choudur Lakshminarayan Classifying unclassified samples
US20150088791A1 (en) * 2013-09-24 2015-03-26 International Business Machines Corporation Generating data from imbalanced training data sets
CN103645249A (en) * 2013-11-27 2014-03-19 国网黑龙江省电力有限公司 Online fault detection method for reduced set-based downsampling unbalance SVM (Support Vector Machine) transformer
US20170372222A1 (en) * 2016-06-24 2017-12-28 Varvara Kollia Technologies for detection of minority events
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN109978009A (en) * 2019-02-27 2019-07-05 广州杰赛科技股份有限公司 Behavior classification method, device and storage medium based on wearable intelligent equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEI-CHAO LIN等: "Clustering-based undersampling in class-imbalanced data", 《INFORMATION SCIENCES》, 31 October 2017 (2017-10-31), pages 17 - 26 *
于艳丽;江开忠;王珂;盛静文;: "改进K均值聚类的不平衡数据欠采样算法", 软件导刊, no. 06, 15 June 2020 (2020-06-15), pages 211 - 215 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066540A (en) * 2021-03-19 2021-07-02 新疆大学 Method for preprocessing non-equilibrium fault sample of oil-immersed transformer

Also Published As

Publication number Publication date
CN112001425B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
US10572885B1 (en) Training method, apparatus for loan fraud detection model and computer device
CN111652710B (en) Personal credit risk assessment method based on integrated tree feature extraction and Logistic regression
CN108475393A (en) The system and method that decision tree is predicted are promoted by composite character and gradient
CN107194803A (en) A kind of P2P nets borrow the device of borrower's assessing credit risks
Alp et al. CMARS and GAM & CQP—modern optimization methods applied to international credit default prediction
JP7472496B2 (en) Model generation device, model generation method, and recording medium
AU2018101523A4 (en) A personal credit scoring model based on machine learning method
CN110796539A (en) Credit investigation evaluation method and device
CN114328461A (en) Big data analysis-based enterprise innovation and growth capacity evaluation method and system
CN112270596A (en) Risk control system and method based on user portrait construction
CN116433081A (en) Enterprise scientific potential evaluation method, system and computer readable storage medium
CN112508684B (en) Collecting-accelerating risk rating method and system based on joint convolutional neural network
CN112001425A (en) Data processing method and device and computer readable storage medium
CN112365352A (en) Anti-cash-out method and device based on graph neural network
CN116800831A (en) Service data pushing method, device, storage medium and processor
Kašćelan et al. Hybrid support vector machine rule extraction method for discovering the preferences of stock market investors: Evidence from Montenegro
CN113177733B (en) Middle and small micro enterprise data modeling method and system based on convolutional neural network
Caplescu et al. Will they repay their debt? Identification of borrowers likely to be charged off
Wu Real-time predictive analysis of loan risk with intelligent monitoring and machine learning technique
Terzi et al. Comparison of financial distress prediction models: Evidence from turkey
Thripuranthakam et al. Stock Market Prediction Using Machine Learning and Twitter Sentiment Analysis: A Survey
Si et al. Credit Risk Assessment by a Comparison Application of Two Boosting Algorithms
Andersson et al. Probability of Default Machine Learning Modeling: A Stress Testing Evaluation
Pradnyana et al. Loan Default Prediction in Microfinance Group Lending with Machine Learning
Pang Big Data Analysis Method based on Statistical Machine Learning: A Case Study of Financial Data Modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant