CN112001425A - Data processing method and device and computer readable storage medium - Google Patents
Data processing method and device and computer readable storage medium Download PDFInfo
- Publication number
- CN112001425A CN112001425A CN202010743665.7A CN202010743665A CN112001425A CN 112001425 A CN112001425 A CN 112001425A CN 202010743665 A CN202010743665 A CN 202010743665A CN 112001425 A CN112001425 A CN 112001425A
- Authority
- CN
- China
- Prior art keywords
- majority
- samples
- sample
- dimension
- minority
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003860 storage Methods 0.000 title claims abstract description 19
- 238000003672 processing method Methods 0.000 title claims abstract description 17
- 238000005070 sampling Methods 0.000 claims abstract description 67
- 238000000034 method Methods 0.000 claims abstract description 46
- 238000013145 classification model Methods 0.000 claims abstract description 18
- 239000006185 dispersion Substances 0.000 claims description 56
- 238000009826 distribution Methods 0.000 claims description 18
- 230000002194 synthesizing effect Effects 0.000 claims description 12
- 230000002159 abnormal effect Effects 0.000 claims description 10
- 239000004576 sand Substances 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 10
- 238000004590 computer program Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a data processing method, a device, a system and a computer readable storage medium, wherein the method comprises the following steps: acquiring a training sample set, wherein the training sample set comprises M majority samples and N minority samples, M, N is a positive integer, and M is greater than N; according to the minority samples and the dimensional characteristics of the majority samples, M majority samples which are discretely distributed around each minority sample are determined to carry out down-sampling on the majority samples, wherein M is smaller than M and is a positive integer; training the classification model according to the minority samples and the majority samples after down-sampling; and processing the data according to the classification model. By using the method, all information of the minority samples can be reserved, and the majority samples subjected to down-sampling processing are distributed around the minority samples in a discrete mode, so that the characteristics with discrimination can be reserved better, and a classification model with more accurate classification effect can be trained.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a data processing method and device and a computer readable storage medium.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In machine learning modeling, there are many unbalanced data sets, that is, the proportion of samples in different categories is very different, for example, in classification information recommendation, image processing, transaction data analysis models, etc., the proportion of abnormal samples is only one ten thousandth, even one several tens of ten thousandth.
In dealing with unbalanced data, there are two most common methods: oversampling and undersampling. The former is to keep all the majority samples and repeat the random sampling of the minority samples, and the latter is to keep all the minority samples and not repeat the random sampling of part of the majority samples, both of which are for the purpose of less imbalance of the final sample class. Such random sampling may cause loss of sample information, and cause that the model cannot learn features with discrimination, thereby affecting the model effect and further affecting the accuracy of data processing.
Disclosure of Invention
In view of the above problems in the prior art, a data processing method, an apparatus and a computer-readable storage medium are provided, by which the above problems can be solved.
The present invention provides the following.
In a first aspect, a data processing method is provided, including: acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N; according to the minority samples and the dimensional characteristics of the majority samples, M majority samples which are discretely distributed around each minority sample are determined to carry out down-sampling on the majority samples, wherein M is smaller than M and is a positive integer; and processing the classification model according to the minority class samples and the majority class samples after down-sampling, and processing the data according to the classification model.
According to one possible embodiment, the m majority samples discretely distributed around each minority sample are determined, further comprising: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining the discrete difference degree L between m majority samples contained in each group of majority samples; determining the distance D between any one minority sample and m majority samples contained in each group of majority samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree LmL/D; according to the degree of difference SmOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.
According to one possible embodiment, determining the degree of difference L in dispersion between the m majority samples included in each group of majority samples comprises: each dimension characteristic of the majority class samples and the minority class samples comprises nsNumerical features of dimension and/or nfDescriptive data of the dimension; for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And, according to the degree of variance LsAnd/or a degree of variance LfThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.
According to one possible embodiment, for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samplessThe method also comprises the following steps: n from M majority samplessNumerical characteristics of the dimension, at nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditionssDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nsDegree of dimensional dispersion, obtaining m majority samplesDegree of difference of dispersion Ls。
According to one possible embodiment, the method further comprises: determining the discrete difference L of m majority samples by using the following formulas:
Wherein n issIs the dimension of a numerical feature, ktTo aim at nsThe number of cells divided by each dimension in the dimension,the number of the majority samples of the m majority samples falling in each divided cell is shown.
According to one possible embodiment, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesfThe method also comprises the following steps: for nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elementsfDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nfDegree of dimension dispersion, obtaining degree of dispersion L of m majority samplesf。
According to one possible embodiment, the method further comprises: determining the discrete difference L of m majority samples by using the following formulaf:
Wherein n isfIn order to describe the dimensions of a type feature,representation collectionNumber of different elements in, setThe descriptive features representing m majority class samples are directed to a feature set of the same dimension.
According to one possible implementation, determining the difference degree of each group of the plurality of types of samples according to the distance and the dispersion difference degree comprises: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.
According to one possible embodiment, the method further comprises: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtainGrouping a plurality of sample groups; determiningDegree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the method further comprising: calculating the distance between the M majority samples and any one minority sample according to the numerical characteristic aiming at any one minority sample; sequencing the M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; determining the difference degree S of q groups of majority sample groupsmSelecting a group of majority samples with the maximum difference as any minority sample corresponding to the majority sampleThe result of the sampling of (1).
According to one possible embodiment, the method further comprises: calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij:
Wherein, a plurality of types of samples XiIncluding nsNumerical characteristics of dimensionsMinority sample YjIncluding nsNumerical data of dimensioni=1,2,...,M,j=1,2,...,N。
In a second aspect, a data processing apparatus is provided, including: the acquisition module is used for acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is larger than N; the down-sampling module is used for determining M majority samples which are discretely distributed around each minority sample according to the minority samples and the dimensionality characteristics of the majority samples so as to down-sample the majority samples, wherein M is smaller than M and is a positive integer; the training module is used for training the classification model according to the minority class samples and the majority class samples after down sampling; and the processing module is used for processing the data according to the classification model.
According to one possible implementation, the down-sampling module is further configured to: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining the discrete difference degree L between m majority samples contained in each group of majority samples; determining any one minority class sample and m majority classes contained in each majority class sample groupThe distance D between samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree LmL/D; according to the degree of difference SmOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.
According to one possible implementation, the down-sampling module is further configured to: majority class samples and minority class samples comprise nsNumerical features of dimension and/or nfDescriptive data of the dimension; for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And, according to the degree of variance LsAnd/or a degree of variance LfThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.
According to one possible implementation, the down-sampling module is further configured to: n from M majority samplessNumerical characteristics of the dimension, at nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditionssDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nsDimension dispersion degree, obtaining dispersion difference degree L of m most sampless。
According to one possible implementation, the down-sampling module is further configured to: determining the discrete difference L of m majority samples by using the following formulas:
Wherein n issIs the dimension of a numerical feature, ktTo aim at nsEach dimension of the dimension is divided into cellsThe number of the first and second groups is,the number of the majority samples of the m majority samples falling in each divided cell is shown.
According to one possible implementation, the down-sampling module is further configured to: for nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elementsfDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nfDegree of dimension dispersion, obtaining degree of dispersion L of m majority samplesf。
According to one possible implementation, the down-sampling module is further configured to: determining the discrete difference L of m majority samples by using the following formulaf:
Wherein n isfIn order to describe the dimensions of a type feature,representation collectionNumber of different elements in, setThe descriptive features representing m majority class samples are directed to a feature set of the same dimension.
According to one possible implementation, determining the difference degree of each group of the plurality of types of samples according to the distance and the dispersion difference degree comprises: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.
According to one possible embodiment, the device further comprises: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtainGrouping a plurality of sample groups; determiningDegree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the apparatus further comprising: calculating the distance between the M majority samples and any one minority sample according to the numerical characteristic aiming at any one minority sample; sequencing the M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; determining the difference degree S of q groups of majority sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the method further comprises: calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij:
Wherein, a plurality of types of samples XiIncluding nsNumerical characteristics of dimensionsMinority sample YjIncluding nsNumerical data of dimensioni=1,2,...,M,j=1,2,...,N。
In a third aspect, a data processing apparatus is provided, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform: the method of the first aspect.
In a fourth aspect, there is provided a computer readable storage medium storing a program which, when executed by a multicore processor, causes the multicore processor to perform the method of the first aspect.
The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects: in this embodiment, all information of a few types of samples acquired according to abnormal data is retained, downsampling processing is performed on a plurality of types of samples acquired according to normal data, and the plurality of types of samples subjected to downsampling processing are distributed around the few types of samples in a discrete manner by using each dimension information of the samples, so that the features with discrimination can be better learned during training of a classification data processing (such as information recommendation or image processing) model.
It should be understood that the above description is only an overview of the technical solutions of the present invention, so as to clearly understand the technical means of the present invention, and thus can be implemented according to the content of the description. In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
The advantages and benefits described herein, as well as other advantages and benefits, will be apparent to those of ordinary skill in the art upon reading the following detailed description of the exemplary embodiments. The drawings are only for purposes of illustrating exemplary embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like elements throughout. In the drawings:
FIG. 1 is a flow chart illustrating a data processing method according to an embodiment of the invention;
FIG. 2 is a flow chart illustrating a data processing method according to another embodiment of the present invention;
FIG. 3 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In the present invention, it is to be understood that terms such as "including" or "having," or the like, are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility of the presence of one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
Those skilled in the art will appreciate that the described application scenario is only one example in which an embodiment of the present invention may be implemented. The scope of applicability of the embodiments of the present invention is not limited in any way. Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Fig. 1 is a schematic flow chart diagram of a data processing method 100 for implementing optimized sampling of training samples according to an embodiment of the present application, in which, from a device perspective, an execution subject may be one or more electronic devices; from the program perspective, the execution main body may accordingly be a program loaded on these electronic devices.
As shown in fig. 1, the method 100 may include:
101, acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N;
step 102, according to the minority samples and the dimension characteristics of the majority samples, determining M majority samples which are discretely distributed around each minority sample so as to perform down-sampling on the majority samples, wherein M is smaller than M and is a positive integer;
and 103, training the classification model according to the minority class samples and the down-sampled majority class samples.
And 104, processing the data according to the classification model.
The classification model can be various classification models such as a classification information recommendation model, an image classification processing model, a transaction data analysis model and the like. In the processing of the unbalanced training sample set, the number of training samples of different classes is greatly different, and the number of samples is far more than the number of samples.
For example, in a training sample of an image classification model for lesion recognition, the number of samples of a majority class collected from normal data (such as image data of a healthy organ) is much larger than the number of samples of a minority class collected from abnormal data (such as image data of a diseased organ). For another example, in the training samples of the transaction data analysis model, the number of most types of samples obtained from normal data (such as normal transaction data) is much larger than the number of few types of samples obtained from abnormal data (such as fraudulent transaction data). Each sample contains a plurality of dimensional features such as numerical data, date data, category data, textual description data, and the like.
The present application can classify sample characteristics into two categories: (1) numerical data: the method has specific numerical values, can quantify the distance between samples after standardization, and can also be used for model training after sampling. (2) Descriptive data: data which is difficult to process numerically and has a function of distinguishing categories, such as date data, category data, word description data and the like.
In this embodiment, all the minority samples are retained, each dimension characteristic of the sample is fully utilized, and m majority samples discretely surrounding the minority samples are selected for each minority sample, so that the characteristic with the discrimination can be better learned during model training. Specifically, picking out m majority class samples for each minority class sample may have the following characteristics: (1) the method is as close as possible to the few samples and keeps certain similarity with the few samples; (2) under the condition of the same distance from a few types of samples, the samples are dispersed as much as possible so as to keep more sample information; (3) and sample information dispersion and sample distance are combined to ensure stable sampling.
In a possible implementation, the step 102 may further include:
step 201, for any one minority sample, sampling a majority sample group of multiple combinations from M majority samples, where each majority sample group includes M majority samples, and M is a positive integer and smaller than M.
In a possible implementation manner, based on the global optimal idea, the step 201 may further include: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtainGrouping a plurality of sample groups; determiningDegree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
In fact, m majority samples are sampled for each minority sample, and N × m majority samples are finally obtained, and considering that the same majority sample is collected by different minority samples, the number of the majority samples finally obtained by sampling is between [ m, N × m ].
In one possible embodiment, the majority of the classes of samples XiIncluding nsNumerical characteristics of dimensionsMinority sample YjIncluding nsNumerical characteristics of dimensionsBased on part of the optimal idea, the step 201 may further include: for any one of a few classes of samples YjCalculating the distance between the M majority samples and any one minority sample according to the numerical characteristic; sequencing the M majority samples according to the distance to obtain a majority sample sequence (X)j1,Xj2,...,XjM) (ii) a Selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence. For example: selecting group 1 as (X)j1,Xj2,...,Xjm) The second group is (X)j2,Xj3,...,Xjm+1) And so on. Further, the difference degree S of q groups of majority sample groups can be determinedmAnd selecting a group of multi-class sample groups with the maximum difference as m multi-class samples which are distributed around any one of the few class samples in a discrete mode.
In one possible embodiment, any one of the majority sample types X may be calculated using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij:
Wherein, a plurality of types of samples XiIncluding nsNumerical characteristics of dimensionsMinority sample YjIncluding nsNumerical data of dimensioni=1,2,...,M,j=1,2,...,N。
Step 202, determining the discrete difference degree L between the m majority samples included in each majority sample group.
In one possible embodiment, the majority class samples and the minority class samples comprise nsNumerical features of dimension and/or nfDescriptive data of the dimension; the step 202 may further include:
for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority sampless. And/or, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And, according to the degree of variance LsAnd/or a degree of variance LfDetermining a degree of difference L in dispersion between the m majority samples included in each group of majority samples, for example: l ═ alpha1Ls+α2LfIn which α is1And alpha2The weight information can be obtained according to historical data.
In this embodiment, the discrete difference L between the m majority samples included in each group of majority samples may be obtained according to the numerical characteristic, the descriptive characteristic, or a combination thereof, and may be adjusted according to an actual situation.
In one possible embodiment, the above is for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samplessThe method also comprises the following steps: n from M majority samplessNumerical characteristics of the dimension innsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditionssDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nsDimension dispersion degree, obtaining dispersion difference degree L of m most sampless。
For example, for a few classes of samples YjLet m majority samples of a majority sample group for which the discrete difference is currently calculated be: (X)j1,Xj2,...,Xjm) The numerical characteristics of the m majority samples are:
…;
in fact, each dimension of the numerical characteristic is a numerical value, and after standardization processing, each dimension has a value interval. And considering the numerical characteristics of all M majority samples, dividing a value interval into a plurality of cell intervals according to each dimension, and observing the distribution of the sampled M majority samples on the cell intervals so as to judge the discrete degree of the samples. Taking the first dimension of the numerical characteristic as an example, the first dimension of the numerical characteristic of all M majority samples is divided into k values in a value interval1And (4) a small interval. The numerical characteristics of the first dimension of all m majority samples areFalls on k1The number of the cells is respectivelyThe following relationships apply:consider thatAt k1The degree of dispersion between the cells can be determined by, for example, using the following formula
Wherein the content of the first and second substances,value is (0, 1)]When inIs uniformly distributed in k1When in a cell, i.e.At this timeTake the maximum value of 1.
For minority class sample YjThe discrete degree of the numerical characteristics of the m majority sample types is defined as:
it can be seen that LsValue is (0, 1)]In between, the more discrete the numerical features of the majority of classes of samples, LsThe closer the value is to 1.
In one possible embodiment, the above is for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesfThe method also comprises the following steps: for nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elementsfDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nfDegree of dimension dispersion, obtaining degree of dispersion L of m majority samplesf。
For example, for a few classes of samples YjLet m majority samples of a majority sample group for which the discrete difference is currently calculated be: (X)j1,Xj2,...,Xjm) The descriptive characteristics of the m majority samples are as follows:
…;
in fact, each dimension of the descriptive information reflects the attribute of the sample, and for the descriptive information of a certain dimension, the more discrete the attribute distribution of the sample, the more information the sample contains.
For example, with Un{x1,x2,...,xnDenotes the set x1,x2,...,xnThe discrete distance of m majority samples in the first dimension of the descriptive feature can be determined by using the following formulaDegree of rotation
Wherein the content of the first and second substances,is taken as (0, 1)]The more discrete the descriptive features of the first dimension,the closer the value is to 1.
For minority class sample YjThe discrete degree of the description-type features of the m most sampled samples is defined as:
as can be seen,value is (0, 1)]In between, the more discrete the numerical information of most classes of samples,the closer the value is to 1.
Step 203, determining the distance D between any one minority sample and the m majority samples included in each group of majority samples.
For example, consider a sequence of samples from a majority class (X)j1,Xj2,...,XjM) M majority of selected samples (X)ji+1,Xji+2,...,Xji+m) With minority class samples YjDetermining all selected m majority samples and minority samples YjSum of distances between:
D=β*[d(Yj,Xji+1)+…+d(Yj,Xji+m)]wherein β is a weight parameter.
Next, step 204 is executed to determine the difference S of each group of majority samples according to the distance D and the dispersion difference Lm=L/D。
In a possible implementation, the step 204 may further include: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.
Next, step 205 is executed according to the difference SmOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples. For example, the degree of difference S is selectedmThe largest set of majority sample groups is down sampled.
In one exemplary embodiment, in the transaction, the normal transaction is a majority transaction and may be scored as a majority sample, and the fraudulent transaction is a very few transaction and may be scored as a minority sample. All samples contain mainly two types of information: (1) descriptive characteristics: card number, merchant number, date, terminal number, etc.; (2) numerical characteristics: and obtaining characteristic data through the transaction message and the historical transaction information. The number of most samples is recorded as M, the number of few samples is recorded as N, and M is far greater than N.
Based on this, a plurality of samples are recordedWhere i 1, 2.., M, a small number of samplesWherein j is 1, 2., N,andis a numerical characteristic of being nsThe dimensional vectors are normalized,andis a descriptive feature, is anfA vector of dimensions.
Each dimension of the numerical characteristic of each sample is a numerical characteristic which is comprehensively calculated by the transaction message, and the numerical characteristic comprises the numerical characteristics of money, average transaction money, transaction period intervals, transaction stroke number and the like and also comprises partial combined structure numerical characteristics. Then, the euclidean distance between any one of the majority class samples and any one of the minority class samples can be calculated using the following formula:
further, a majority of sample diversity functions may be constructed using the numerical and descriptive characteristics of the samples.
For the numerical characteristics of the sample, the amount, the average transaction amount, the transaction period interval, and the number of transaction strokes are taken as examples. For minority class sample YjCalculating a plurality of sample sequences Xj1,Xj2,Xj3,...,XjMThe degree of dispersion of the numerical features of the m majority class samples. (1) Using amtjiRepresenting the transaction amount of the sample, and dividing the value interval into k1The number of cells falling into each cell is(2) Use agv _ amtjiRepresenting the average transaction amount of the sample, and dividing k by the value-taking interval2The number of cells falling into each cell is(3) Using tjiRepresenting the trade period interval of the sample, dividing k by the value-taking interval3The number of cells falling into each cell is(4) Use ntjiRepresenting the number of transactions of the sample, dividing k by the value-taking interval4The number of cells falling into each cell is
then for a numerical feature, the degree of dispersion of the multiple classes of samples can be expressed as:
for the descriptive characteristics of the sample, the card number, the merchant number and the transaction date are taken as examples, and the sample Y is used for a few typesjCalculating Xj1,Xj2,Xj3,...,XjMThe m majority class samples in (a) describe the degree of dispersion of the type feature. Use of CjiRepresents a sample XjiCard number information of (1), MjiRepresents a sample XjiMerchant information of (1), TjiRepresents a sample XjiTime information of, i.e. sample XjiThe descriptive characteristics of (a) are expressed as:
the degree of dispersion of the majority type of sample description features can be expressed as:
and combining the numerical characteristics and the descriptive characteristics to construct a discrete degree function of a plurality of types of samples:
L=α1*Ls+α2*Lf
further consider that from Xj1,Xj2,Xj3,...,XjMM majority samples and minority samples Y selected from the groupjThe distance of (c): d ═ β [ [ D ] (Y) ]j,Xji+1)+…+d(Yj,Xji+m)]All the selected m majority samples and minority samples YjThe sum of the distances between them.
In summary, based on the weight combination, the difference function may be:
optionally, a weighting parameter α may be taken1,α2β is 1 to simplify the calculation, i.e., the disparity function is:
optionally, for a few classes of samples YjThe M majority samples can be sorted from near to far as X according to Euclidean distancej1,Xj2,Xj3,...,XjMFurther, from the majority class sample sequence Xj1,Xj2,Xj3,...,XjMThe first group is that m majority samples are selected: xj1,Xj2,...,Xjm(ii) a Second group: xj2,Xj3,...,Xjm+1(ii) a …, and the likeMultiple sets of majority sample sets can be obtained.
The difference degree of m majority samples in each majority sample group can be calculated respectively, and the difference degree S is selectedmThe largest group is the m majority samples of the final sample.
In fact, for all N minority class samples, the number of majority class samples sampled from M majority class samples is between [ M, N × M ]. According to the experience of pseudo-card model training, a positive minority sample ratio of 1000: 1 is generally selected, that is, 1000 majority samples need to be sampled for each minority sample. And finally, performing model training by using the sampled majority samples and all minority samples.
In another exemplary embodiment, a technical effect of the data processing method of the present application is exemplarily described. For example, for a financial institution, the average transaction number 175538 per day, the fraud transaction number 422, and the ratio of majority class/minority class samples is 37437: 1, which are obtained in the magnetic stripe card online environment of 90 days in 2019 in 4, 5, 6, and 7 months, and the majority class samples are down-sampled and model-trained according to the scheme of the above embodiment. All of the majority samples can be first down-sampled at a 50: 1 sampling ratio, sampling 3510 transactions per day. Specifically, assuming that the sample has 500-dimensional numerical features, wherein 21-dimensional descriptive information exists, the 21-dimensional numerical features (e.g., 2 merchant statistical features, 3 card-dimensional statistical features, 6 real-time statistical features, and 10 historical statistical features) are selected in consideration of the saturation, interval distribution, and other factors of each dimensional information, and the 4 pieces of dimensional information are combined from the 2-dimensional descriptive information (e.g., merchant number and transaction total amount) to perform the downsampling process in the above embodiment.
(1) Sample information saturation: referring to table 1, the saturation of 19 dimensional features is greater than that of the random sampling scheme, wherein the maximum is 40.7% higher, 7 are 30% higher, and two saturations are slightly lower than that of the random sampling scheme, for 21 dimensional features obtained by the down-sampling scheme in the present application. Of all 500 dimensions, 436 dimensions have higher saturation than random sampling.
Table 1:
(2) sample feature segmentation distribution:
referring to table 2, it can be seen that 20 segment distribution differences of 21 dimensional features obtained by sampling with the down-sampling scheme of the present application are all smaller than those of the random sampling scheme, which is reduced by 37.5% to the maximum and 30% to 4; with 1 segment distribution differing slightly more than the random sampling scheme.
Table 2:
(3) description of information combination ratio:
the assumed combination of description information is: the total amount of transaction in the merchant number +5 minutes, the total amount of transaction in the merchant number +15 minutes, the total amount of transaction in the merchant number +120 minutes, and the total amount of transaction in the merchant number +1 day. Referring to table 3, it can be seen that the degree of dispersion of the descriptive information sampled by the down-sampling scheme of the present application is greater than that of the random sampling scheme.
Table 3:
index (I) | Down sampling scheme | Random sampling scheme | Difference of difference |
Merchant number +5min total transaction amount | 76.5% | 74.8% | 1.7% |
Merchant number +15min total transaction amount | 78.7% | 76.1% | 2.6% |
Merchant number +120min total transaction amount | 81.8% | 79.0% | 2.8% |
(4) The model training effect is as follows: substituting the data of months 4, 5 and 6 in 2019, namely the number of digits 383 of a few class samples into an xgboost model, wherein the parameters are as follows: max _ depth: 3; eta: 0.01; min _ child _ weight: 6; gamma: 0.1; lambda: 10; subsample: 0.8
With the same parameters above, see table 4, the behavior of the model on the training set is as follows:
table 4:
train-auc | val-auc | trees | |
down sampling scheme | 0.941778 | 0.891961 | 568 |
Random sampling scheme | 0.936219 | 0.891842 | 515 |
Referring to Table 5, the extrapolated validation effect on the 7-month data (38 minority samples) is:
table 5:
down sampling scheme | Random sampling scheme | |
Recall rate | Rate of accuracy | Rate of accuracy |
5.1% | 28.6% | 33.3% |
10.3% | 40.0% | 19.0% |
15.4% | 10.0% | 14.3% |
Based on the same technical concept, an embodiment of the present invention further provides a data processing apparatus, configured to execute the data processing method provided in any of the above embodiments. Fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
As shown in fig. 3, the apparatus 300 includes:
an obtaining module 301, configured to obtain a training sample set, where the training sample set includes M majority samples obtained according to normal data and N minority samples obtained according to abnormal data, M, N is a positive integer, and M is greater than N;
a down-sampling module 302, configured to determine M majority samples discretely distributed around each minority sample according to the minority samples and the dimensionality characteristics of the majority samples, so as to down-sample the majority samples, where M is smaller than M and is a positive integer;
and the training module 303 is configured to train the classification model according to the minority class samples and the down-sampled majority class samples.
And the processing module 304 is used for processing the data according to the classification model.
According to one possible implementation, the down-sampling module 302 is further configured to: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining discrete differences between m majority samples contained in each set of majority samplesA degree L; determining the distance D between any one minority sample and m majority samples contained in each group of majority samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree LmL/D; according to the degree of difference SmOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.
According to one possible implementation, the down-sampling module 302 is further configured to: majority class samples and minority class samples comprise nsNumerical features of dimension and/or nfDescriptive data of the dimension; for nsThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or, for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And, according to the degree of variance LsAnd/or a degree of variance LfThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.
According to one possible implementation, the down-sampling module 302 is further configured to: n from M majority samplessNumerical characteristics of the dimension, at nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditionssDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nsDimension dispersion degree, obtaining dispersion difference degree L of m most sampless。
According to one possible implementation, the down-sampling module 302 is further configured to: determining the discrete difference L of m majority samples by using the following formulas:
Wherein n issIs a number ofDimension, k, of a value-type featuretTo aim at nsThe number of cells divided by each dimension in the dimension,the number of the majority samples of the m majority samples falling in each divided cell is shown.
According to one possible implementation, the down-sampling module 302 is further configured to: for nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elementsfDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in nfDegree of dimension dispersion, obtaining degree of dispersion L of m majority samplesf。
According to one possible implementation, the down-sampling module 302 is further configured to: determining the discrete difference L of m majority samples by using the following formulaf:
Wherein n isfIn order to describe the dimensions of a type feature,representation collectionNumber of different elements in, setThe descriptive features representing m majority class samples are directed to a feature set of the same dimension.
According to one possible implementation, determining the difference degree of each group of the plurality of types of samples according to the distance and the dispersion difference degree comprises: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.
According to one possible embodiment, the device further comprises: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtainGrouping a plurality of sample groups; determiningDegree of difference S of a plurality of sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the apparatus further comprising: calculating the distance between the M majority samples and any one minority sample according to the numerical characteristic aiming at any one minority sample; sequencing the M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; determining the difference degree S of q groups of majority sample groupsmAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.
According to one possible embodiment, the method further comprises: calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij:
Wherein, a plurality of types of samples XiIncluding nsNumerical characteristics of dimensionsMinority sample YjIncluding nsNumerical data of dimensioni=1,2,...,M,j=1,2,...,N。
It should be noted that the data processing apparatus in the embodiment of the present application may implement each process of the foregoing embodiment of the data processing method, and achieve the same effect and function, which is not described herein again.
Fig. 4 is a data processing apparatus according to an embodiment of the present application, configured to execute the data processing method shown in fig. 1, where the apparatus includes: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the data processing method shown in the above embodiments
According to some embodiments of the present application, there is provided a non-volatile computer storage medium of a data processing method having stored thereon computer-executable instructions configured to, when executed by a processor, perform the data processing method illustrated in the embodiments described above
The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device, and computer-readable storage medium embodiments, the description is simplified because they are substantially similar to the method embodiments, and reference may be made to some descriptions of the method embodiments for their relevance.
The apparatus, the device, and the computer-readable storage medium provided in the embodiment of the present application correspond to the method one to one, and therefore, the apparatus, the device, and the computer-readable storage medium also have advantageous technical effects similar to those of the corresponding method.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Claims (24)
1. A data processing method, comprising:
acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N;
according to the dimension characteristics of the minority samples and the majority samples, M majority samples which are distributed around each minority sample are determined to perform down-sampling on the majority samples, wherein M is smaller than M and is a positive integer;
training a classification model according to the minority samples and the majority samples after down-sampling;
and processing the data according to the classification model.
2. The method of claim 1, wherein determining m majority samples discretely distributed around each minority sample further comprises:
sampling multiple combined multiple sample groups from the M multiple samples for any one of the multiple samples, wherein each multiple sample group comprises M multiple samples;
determining a discrete difference degree L between m majority samples contained in each group of majority samples;
determining a distance D between the arbitrary one minority sample and the m majority samples included in each of the plurality of sets of majority samples;
determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree Lm=L/D;
According to the difference degree SmDetermining one of the plurality of combined majority sample sets as a plurality m of samples discretely distributed around the any one minority sampleSeveral types of samples.
3. The method of claim 2, wherein determining the degree of discrete difference L between the m majority samples included in each set of majority samples comprises:
each dimension feature of the majority class sample and the minority class sample comprises nsNumerical features of dimension and/or nfDescriptive data of the dimension;
for nsThe numerical characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or the presence of a gas in the gas,
for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And the number of the first and second groups,
according to the discrete difference degree LsAnd/or the degree of variance L of the dispersionfDetermining a degree of discrete difference L between m majority samples included in each set of majority samples.
4. The method of claim 3, wherein n is the lowest value of nsThe numerical characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplessThe method also comprises the following steps:
n according to the M majority samplessNumerical characteristics of the dimension, in said nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension;
determining the distribution situation of m majority samples contained in each group of majority samples among the plurality of cells;
determining the m majority samples in the n according to the distribution conditionsDegree of dispersion in each dimension of a dimension;
synthesizing the m majority samples at the nsThe degree of dispersion of the dimension is obtained, and the degree of dispersion difference L of the m majority samples is obtaineds。
5. The method of claim 4, further comprising:
determining the discrete difference degree L of the m majority samples by using the following formulas:
6. The method of claim 3, wherein n is the lowest value of nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesfThe method also comprises the following steps:
for said nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension;
determining the number of m majority class samples in the n according to the number of the minority class elementsfDegree of dispersion in each dimension of a dimension;
synthesizing the m majority samples at the nfThe degree of dispersion of the dimension is obtained, and the degree of dispersion L of the m majority samples is obtainedf。
7. The method of claim 6, further comprising:
determining the discrete difference degree L of the m majority samples by using the following formulaf:
8. The method of claim 5, wherein determining the degree of difference for each of the plurality of sets of samples based on the distance and the degree of variance comprises:
according to the numerical characteristic, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group:
and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of the Euclidean distances.
9. The method of claim 2, further comprising:
for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtainGrouping a plurality of sample groups;
10. The method of claim 2, wherein the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the method further comprising:
for any one minority sample, calculating the distance between the M majority samples and the any one minority sample according to the numerical characteristic;
sequencing the M majority samples according to the distance to obtain a majority sample sequence;
selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence;
determining the degree of difference S of the q groups of majority sample groupsmAnd selecting a group of majority sample groups with the maximum difference as the sampling result corresponding to any one minority sample.
11. The method of claim 8 or 10, further comprising:
calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij:
12. A data processing apparatus, comprising:
an obtaining module, configured to obtain a training sample set, where the training sample set includes M majority samples obtained according to normal data and N minority samples obtained according to abnormal data, where M, N is a positive integer, and M is greater than N;
a down-sampling module, configured to determine M majority samples discretely distributed around each minority sample according to the minority sample and each dimensional feature of the majority sample, so as to down-sample the majority sample, where M is smaller than M and is a positive integer;
the training module is used for training a classification model according to the minority class samples and the majority class samples after down-sampling;
and the processing module is used for processing the data according to the classification model.
13. The apparatus of claim 12, wherein the downsampling module is further configured to:
sampling multiple combined multiple sample groups from the M multiple samples for any one of the multiple samples, wherein each multiple sample group comprises M multiple samples;
determining a discrete difference degree L between m majority samples contained in each group of majority samples;
determining a distance D between the arbitrary one minority sample and the m majority samples included in each of the plurality of sets of majority samples;
determining the majority sample group according to the distance D and the dispersion difference LDegree of difference Sm=L/D;
According to the difference degree SmDetermining one of the plurality of combined majority sample sets as m majority samples discretely distributed around the any one minority sample.
14. The apparatus of claim 13, wherein the downsampling module is further configured to:
the majority class samples and the minority class samples comprise nsNumerical features of dimension and/or nfDescriptive data of the dimension;
for nsThe numerical characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority sampless(ii) a And/or the presence of a gas in the gas,
for nfThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samplesf(ii) a And the number of the first and second groups,
according to the discrete difference degree LsAnd/or the degree of variance L of the dispersionfDetermining a degree of discrete difference L between m majority samples included in each set of majority samples.
15. The apparatus of claim 14, wherein the downsampling module is further configured to:
n according to the M majority samplessNumerical characteristics of the dimension, in said nsDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension;
determining the distribution situation of m majority samples contained in each group of majority samples among the plurality of cells;
determining the m majority samples in the n according to the distribution conditionsDegree of dispersion in each dimension of a dimension;
synthesizing the m majority samples at the nsThe degree of dispersion of the dimension is obtained, and the degree of dispersion difference L of the m majority samples is obtaineds。
16. The apparatus of claim 15, wherein the downsampling module is further configured to:
determining the discrete difference degree L of the m majority samples by using the following formulas:
17. The apparatus of claim 14, wherein the downsampling module is further configured to:
for said nfDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension;
determining the number of m majority class samples in the n according to the number of the minority class elementsfDegree of dispersion in each dimension of a dimension;
synthesizing the m majority samples at the nfThe degree of dispersion of the dimension is obtained, and the degree of dispersion L of the m majority samples is obtainedf。
18. The apparatus of claim 17, wherein the downsampling module is further configured to:
determining the discrete difference degree L of the m majority samples by using the following formulaf:
19. The apparatus of claim 13, wherein determining the degree of difference for each of the plurality of sets of samples according to the distance and the degree of variance comprises:
according to the numerical characteristic, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group:
and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of the Euclidean distances.
20. The apparatus of claim 13, further comprising:
for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtainGrouping a plurality of sample groups;
21. The apparatus of claim 13, wherein the majority class samples and the minority class samples comprise nsA numerical characterization of the dimension, the apparatus further comprising:
for any one minority sample, calculating the distance between the M majority samples and the any one minority sample according to the numerical characteristic;
sequencing the M majority samples according to the distance to obtain a majority sample sequence;
selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence;
determining the degree of difference S of the q groups of majority sample groupsmAnd selecting a group of majority sample groups with the maximum difference as the sampling result corresponding to any one minority sample.
22. The apparatus of claim 19 or 21, further comprising:
calculating any one of a plurality of samples X by using the following formulaiAnd any one of the minority samples YjEuclidean distance of dij:
23. A data processing apparatus, comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform: the method of any one of claims 1-11.
24. A computer-readable storage medium storing a program that, when executed by a multi-core processor, causes the multi-core processor to perform the method of any of claims 1-11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010743665.7A CN112001425B (en) | 2020-07-29 | 2020-07-29 | Data processing method, device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010743665.7A CN112001425B (en) | 2020-07-29 | 2020-07-29 | Data processing method, device and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112001425A true CN112001425A (en) | 2020-11-27 |
CN112001425B CN112001425B (en) | 2024-05-03 |
Family
ID=73464171
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010743665.7A Active CN112001425B (en) | 2020-07-29 | 2020-07-29 | Data processing method, device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112001425B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113066540A (en) * | 2021-03-19 | 2021-07-02 | 新疆大学 | Method for preprocessing non-equilibrium fault sample of oil-immersed transformer |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140032450A1 (en) * | 2012-07-30 | 2014-01-30 | Choudur Lakshminarayan | Classifying unclassified samples |
CN103645249A (en) * | 2013-11-27 | 2014-03-19 | 国网黑龙江省电力有限公司 | Online fault detection method for reduced set-based downsampling unbalance SVM (Support Vector Machine) transformer |
US20150088791A1 (en) * | 2013-09-24 | 2015-03-26 | International Business Machines Corporation | Generating data from imbalanced training data sets |
US20170372222A1 (en) * | 2016-06-24 | 2017-12-28 | Varvara Kollia | Technologies for detection of minority events |
CN108628971A (en) * | 2018-04-24 | 2018-10-09 | 深圳前海微众银行股份有限公司 | File classification method, text classifier and the storage medium of imbalanced data sets |
CN109978009A (en) * | 2019-02-27 | 2019-07-05 | 广州杰赛科技股份有限公司 | Behavior classification method, device and storage medium based on wearable intelligent equipment |
-
2020
- 2020-07-29 CN CN202010743665.7A patent/CN112001425B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140032450A1 (en) * | 2012-07-30 | 2014-01-30 | Choudur Lakshminarayan | Classifying unclassified samples |
US20150088791A1 (en) * | 2013-09-24 | 2015-03-26 | International Business Machines Corporation | Generating data from imbalanced training data sets |
CN103645249A (en) * | 2013-11-27 | 2014-03-19 | 国网黑龙江省电力有限公司 | Online fault detection method for reduced set-based downsampling unbalance SVM (Support Vector Machine) transformer |
US20170372222A1 (en) * | 2016-06-24 | 2017-12-28 | Varvara Kollia | Technologies for detection of minority events |
CN108628971A (en) * | 2018-04-24 | 2018-10-09 | 深圳前海微众银行股份有限公司 | File classification method, text classifier and the storage medium of imbalanced data sets |
CN109978009A (en) * | 2019-02-27 | 2019-07-05 | 广州杰赛科技股份有限公司 | Behavior classification method, device and storage medium based on wearable intelligent equipment |
Non-Patent Citations (2)
Title |
---|
WEI-CHAO LIN等: "Clustering-based undersampling in class-imbalanced data", 《INFORMATION SCIENCES》, 31 October 2017 (2017-10-31), pages 17 - 26 * |
于艳丽;江开忠;王珂;盛静文;: "改进K均值聚类的不平衡数据欠采样算法", 软件导刊, no. 06, 15 June 2020 (2020-06-15), pages 211 - 215 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113066540A (en) * | 2021-03-19 | 2021-07-02 | 新疆大学 | Method for preprocessing non-equilibrium fault sample of oil-immersed transformer |
Also Published As
Publication number | Publication date |
---|---|
CN112001425B (en) | 2024-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10572885B1 (en) | Training method, apparatus for loan fraud detection model and computer device | |
CN111652710B (en) | Personal credit risk assessment method based on integrated tree feature extraction and Logistic regression | |
CN108475393A (en) | The system and method that decision tree is predicted are promoted by composite character and gradient | |
CN107194803A (en) | A kind of P2P nets borrow the device of borrower's assessing credit risks | |
Alp et al. | CMARS and GAM & CQP—modern optimization methods applied to international credit default prediction | |
JP7472496B2 (en) | Model generation device, model generation method, and recording medium | |
AU2018101523A4 (en) | A personal credit scoring model based on machine learning method | |
CN110796539A (en) | Credit investigation evaluation method and device | |
CN114328461A (en) | Big data analysis-based enterprise innovation and growth capacity evaluation method and system | |
CN112270596A (en) | Risk control system and method based on user portrait construction | |
CN116433081A (en) | Enterprise scientific potential evaluation method, system and computer readable storage medium | |
CN112508684B (en) | Collecting-accelerating risk rating method and system based on joint convolutional neural network | |
CN112001425A (en) | Data processing method and device and computer readable storage medium | |
CN112365352A (en) | Anti-cash-out method and device based on graph neural network | |
CN116800831A (en) | Service data pushing method, device, storage medium and processor | |
Kašćelan et al. | Hybrid support vector machine rule extraction method for discovering the preferences of stock market investors: Evidence from Montenegro | |
CN113177733B (en) | Middle and small micro enterprise data modeling method and system based on convolutional neural network | |
Caplescu et al. | Will they repay their debt? Identification of borrowers likely to be charged off | |
Wu | Real-time predictive analysis of loan risk with intelligent monitoring and machine learning technique | |
Terzi et al. | Comparison of financial distress prediction models: Evidence from turkey | |
Thripuranthakam et al. | Stock Market Prediction Using Machine Learning and Twitter Sentiment Analysis: A Survey | |
Si et al. | Credit Risk Assessment by a Comparison Application of Two Boosting Algorithms | |
Andersson et al. | Probability of Default Machine Learning Modeling: A Stress Testing Evaluation | |
Pradnyana et al. | Loan Default Prediction in Microfinance Group Lending with Machine Learning | |
Pang | Big Data Analysis Method based on Statistical Machine Learning: A Case Study of Financial Data Modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |