CN112001425A

CN112001425A - Data processing method and device and computer readable storage medium

Info

Publication number: CN112001425A
Application number: CN202010743665.7A
Authority: CN
Inventors: 马振伟; 邹勇; 林芃; 孙浩然; 肖鹰东
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-11-27
Anticipated expiration: 2040-07-29
Also published as: CN112001425B

Abstract

The invention provides a data processing method, a device, a system and a computer readable storage medium, wherein the method comprises the following steps: acquiring a training sample set, wherein the training sample set comprises M majority samples and N minority samples, M, N is a positive integer, and M is greater than N; according to the minority samples and the dimensional characteristics of the majority samples, M majority samples which are discretely distributed around each minority sample are determined to carry out down-sampling on the majority samples, wherein M is smaller than M and is a positive integer; training the classification model according to the minority samples and the majority samples after down-sampling; and processing the data according to the classification model. By using the method, all information of the minority samples can be reserved, and the majority samples subjected to down-sampling processing are distributed around the minority samples in a discrete mode, so that the characteristics with discrimination can be reserved better, and a classification model with more accurate classification effect can be trained.

Description

Data processing method and device and computer readable storage medium

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a data processing method and device and a computer readable storage medium.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In machine learning modeling, there are many unbalanced data sets, that is, the proportion of samples in different categories is very different, for example, in classification information recommendation, image processing, transaction data analysis models, etc., the proportion of abnormal samples is only one ten thousandth, even one several tens of ten thousandth.

In dealing with unbalanced data, there are two most common methods: oversampling and undersampling. The former is to keep all the majority samples and repeat the random sampling of the minority samples, and the latter is to keep all the minority samples and not repeat the random sampling of part of the majority samples, both of which are for the purpose of less imbalance of the final sample class. Such random sampling may cause loss of sample information, and cause that the model cannot learn features with discrimination, thereby affecting the model effect and further affecting the accuracy of data processing.

Disclosure of Invention

In view of the above problems in the prior art, a data processing method, an apparatus and a computer-readable storage medium are provided, by which the above problems can be solved.

The present invention provides the following.

In a first aspect, a data processing method is provided, including: acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N; according to the minority samples and the dimensional characteristics of the majority samples, M majority samples which are discretely distributed around each minority sample are determined to carry out down-sampling on the majority samples, wherein M is smaller than M and is a positive integer; and processing the classification model according to the minority class samples and the majority class samples after down-sampling, and processing the data according to the classification model.

According to one possible embodiment, the m majority samples discretely distributed around each minority sample are determined, further comprising: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining the discrete difference degree L between m majority samples contained in each group of majority samples; determining the distance D between any one minority sample and m majority samples contained in each group of majority samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree L_mL/D; according to the degree of difference S_mOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.

According to one possible embodiment, determining the degree of difference L in dispersion between the m majority samples included in each group of majority samples comprises: each dimension characteristic of the majority class samples and the minority class samples comprises n_sNumerical features of dimension and/or n_fDescriptive data of the dimension; for n_sThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samples_s(ii) a And/or, for n_fThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_f(ii) a And, according to the degree of variance L_sAnd/or a degree of variance L_fThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.

According to one possible embodiment, for n_sThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samples_sThe method also comprises the following steps: n from M majority samples_sNumerical characteristics of the dimension, at n_sDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditions_sDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in n_sDegree of dimensional dispersion, obtaining m majority samplesDegree of difference of dispersion L_s。

According to one possible embodiment, the method further comprises: determining the discrete difference L of m majority samples by using the following formula_s：

Wherein n is_sIs the dimension of a numerical feature, k_tTo aim at n_sThe number of cells divided by each dimension in the dimension,

the number of the majority samples of the m majority samples falling in each divided cell is shown.

According to one possible embodiment, for n_fThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_fThe method also comprises the following steps: for n_fDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elements_fDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in n_fDegree of dimension dispersion, obtaining degree of dispersion L of m majority samples_f。

According to one possible embodiment, the method further comprises: determining the discrete difference L of m majority samples by using the following formula_f：

Wherein n is_fIn order to describe the dimensions of a type feature,

representation collection

Number of different elements in, set

The descriptive features representing m majority class samples are directed to a feature set of the same dimension.

According to one possible implementation, determining the difference degree of each group of the plurality of types of samples according to the distance and the dispersion difference degree comprises: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.

According to one possible embodiment, the method further comprises: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain

Grouping a plurality of sample groups; determining

Degree of difference S of a plurality of sample groups_mAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.

According to one possible embodiment, the majority class samples and the minority class samples comprise n_sA numerical characterization of the dimension, the method further comprising: calculating the distance between the M majority samples and any one minority sample according to the numerical characteristic aiming at any one minority sample; sequencing the M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; determining the difference degree S of q groups of majority sample groups_mSelecting a group of majority samples with the maximum difference as any minority sample corresponding to the majority sampleThe result of the sampling of (1).

According to one possible embodiment, the method further comprises: calculating any one of a plurality of samples X by using the following formula_iAnd any one of the minority samples Y_jEuclidean distance of d_ij：

Wherein, a plurality of types of samples X_iIncluding n_sNumerical characteristics of dimensions

Minority sample Y_jIncluding n_sNumerical data of dimension

i＝1，2，...，M，j＝1，2，...，N。

In a second aspect, a data processing apparatus is provided, including: the acquisition module is used for acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is larger than N; the down-sampling module is used for determining M majority samples which are discretely distributed around each minority sample according to the minority samples and the dimensionality characteristics of the majority samples so as to down-sample the majority samples, wherein M is smaller than M and is a positive integer; the training module is used for training the classification model according to the minority class samples and the majority class samples after down sampling; and the processing module is used for processing the data according to the classification model.

According to one possible implementation, the down-sampling module is further configured to: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining the discrete difference degree L between m majority samples contained in each group of majority samples; determining any one minority class sample and m majority classes contained in each majority class sample groupThe distance D between samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree L_mL/D; according to the degree of difference S_mOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.

According to one possible implementation, the down-sampling module is further configured to: majority class samples and minority class samples comprise n_sNumerical features of dimension and/or n_fDescriptive data of the dimension; for n_sThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samples_s(ii) a And/or, for n_fThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_f(ii) a And, according to the degree of variance L_sAnd/or a degree of variance L_fThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.

According to one possible implementation, the down-sampling module is further configured to: n from M majority samples_sNumerical characteristics of the dimension, at n_sDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditions_sDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in n_sDimension dispersion degree, obtaining dispersion difference degree L of m most samples_s。

According to one possible implementation, the down-sampling module is further configured to: determining the discrete difference L of m majority samples by using the following formula_s：

Wherein n is_sIs the dimension of a numerical feature, k_tTo aim at n_sEach dimension of the dimension is divided into cellsThe number of the first and second groups is,

According to one possible implementation, the down-sampling module is further configured to: for n_fDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elements_fDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in n_fDegree of dimension dispersion, obtaining degree of dispersion L of m majority samples_f。

According to one possible implementation, the down-sampling module is further configured to: determining the discrete difference L of m majority samples by using the following formula_f：

Wherein n is_fIn order to describe the dimensions of a type feature,

representation collection

Number of different elements in, set

According to one possible embodiment, the device further comprises: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain

Grouping a plurality of sample groups; determining

According to one possible embodiment, the majority class samples and the minority class samples comprise n_sA numerical characterization of the dimension, the apparatus further comprising: calculating the distance between the M majority samples and any one minority sample according to the numerical characteristic aiming at any one minority sample; sequencing the M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; determining the difference degree S of q groups of majority sample groups_mAnd selecting a group of majority sample groups with the maximum difference as a sampling result corresponding to any minority sample.

Minority sample Y_jIncluding n_sNumerical data of dimension

i＝1，2，...，M，j＝1，2，...，N。

In a third aspect, a data processing apparatus is provided, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform: the method of the first aspect.

In a fourth aspect, there is provided a computer readable storage medium storing a program which, when executed by a multicore processor, causes the multicore processor to perform the method of the first aspect.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects: in this embodiment, all information of a few types of samples acquired according to abnormal data is retained, downsampling processing is performed on a plurality of types of samples acquired according to normal data, and the plurality of types of samples subjected to downsampling processing are distributed around the few types of samples in a discrete manner by using each dimension information of the samples, so that the features with discrimination can be better learned during training of a classification data processing (such as information recommendation or image processing) model.

It should be understood that the above description is only an overview of the technical solutions of the present invention, so as to clearly understand the technical means of the present invention, and thus can be implemented according to the content of the description. In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

The advantages and benefits described herein, as well as other advantages and benefits, will be apparent to those of ordinary skill in the art upon reading the following detailed description of the exemplary embodiments. The drawings are only for purposes of illustrating exemplary embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like elements throughout. In the drawings:

FIG. 1 is a flow chart illustrating a data processing method according to an embodiment of the invention;

FIG. 2 is a flow chart illustrating a data processing method according to another embodiment of the present invention;

FIG. 3 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the present invention, it is to be understood that terms such as "including" or "having," or the like, are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility of the presence of one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Those skilled in the art will appreciate that the described application scenario is only one example in which an embodiment of the present invention may be implemented. The scope of applicability of the embodiments of the present invention is not limited in any way. Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Fig. 1 is a schematic flow chart diagram of a data processing method 100 for implementing optimized sampling of training samples according to an embodiment of the present application, in which, from a device perspective, an execution subject may be one or more electronic devices; from the program perspective, the execution main body may accordingly be a program loaded on these electronic devices.

As shown in fig. 1, the method 100 may include:

101, acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N;

step 102, according to the minority samples and the dimension characteristics of the majority samples, determining M majority samples which are discretely distributed around each minority sample so as to perform down-sampling on the majority samples, wherein M is smaller than M and is a positive integer;

and 103, training the classification model according to the minority class samples and the down-sampled majority class samples.

And 104, processing the data according to the classification model.

The classification model can be various classification models such as a classification information recommendation model, an image classification processing model, a transaction data analysis model and the like. In the processing of the unbalanced training sample set, the number of training samples of different classes is greatly different, and the number of samples is far more than the number of samples.

For example, in a training sample of an image classification model for lesion recognition, the number of samples of a majority class collected from normal data (such as image data of a healthy organ) is much larger than the number of samples of a minority class collected from abnormal data (such as image data of a diseased organ). For another example, in the training samples of the transaction data analysis model, the number of most types of samples obtained from normal data (such as normal transaction data) is much larger than the number of few types of samples obtained from abnormal data (such as fraudulent transaction data). Each sample contains a plurality of dimensional features such as numerical data, date data, category data, textual description data, and the like.

The present application can classify sample characteristics into two categories: (1) numerical data: the method has specific numerical values, can quantify the distance between samples after standardization, and can also be used for model training after sampling. (2) Descriptive data: data which is difficult to process numerically and has a function of distinguishing categories, such as date data, category data, word description data and the like.

In this embodiment, all the minority samples are retained, each dimension characteristic of the sample is fully utilized, and m majority samples discretely surrounding the minority samples are selected for each minority sample, so that the characteristic with the discrimination can be better learned during model training. Specifically, picking out m majority class samples for each minority class sample may have the following characteristics: (1) the method is as close as possible to the few samples and keeps certain similarity with the few samples; (2) under the condition of the same distance from a few types of samples, the samples are dispersed as much as possible so as to keep more sample information; (3) and sample information dispersion and sample distance are combined to ensure stable sampling.

In a possible implementation, the step 102 may further include:

step 201, for any one minority sample, sampling a majority sample group of multiple combinations from M majority samples, where each majority sample group includes M majority samples, and M is a positive integer and smaller than M.

In a possible implementation manner, based on the global optimal idea, the step 201 may further include: for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain

Grouping a plurality of sample groups; determining

In fact, m majority samples are sampled for each minority sample, and N × m majority samples are finally obtained, and considering that the same majority sample is collected by different minority samples, the number of the majority samples finally obtained by sampling is between [ m, N × m ].

In one possible embodiment, the majority of the classes of samples X_iIncluding n_sNumerical characteristics of dimensions

Minority sample Y_jIncluding n_sNumerical characteristics of dimensions

Based on part of the optimal idea, the step 201 may further include: for any one of a few classes of samples Y_jCalculating the distance between the M majority samples and any one minority sample according to the numerical characteristic; sequencing the M majority samples according to the distance to obtain a majority sample sequence (X)_j1，X_j2，...，X_jM) (ii) a Selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence. For example: selecting group 1 as (X)_j1，X_j2，...，X_jm) The second group is (X)_j2，X_j3，...，X_jm+1) And so on. Further, the difference degree S of q groups of majority sample groups can be determined_mAnd selecting a group of multi-class sample groups with the maximum difference as m multi-class samples which are distributed around any one of the few class samples in a discrete mode.

In one possible embodiment, any one of the majority sample types X may be calculated using the following formula_iAnd any one of the minority samples Y_jEuclidean distance of d_ij：

Minority sample Y_jIncluding n_sNumerical data of dimension

i＝1，2，...，M，j＝1，2，...，N。

Step 202, determining the discrete difference degree L between the m majority samples included in each majority sample group.

In one possible embodiment, the majority class samples and the minority class samples comprise n_sNumerical features of dimension and/or n_fDescriptive data of the dimension; the step 202 may further include:

for n_sThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samples_s. And/or, for n_fThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_f(ii) a And, according to the degree of variance L_sAnd/or a degree of variance L_fDetermining a degree of difference L in dispersion between the m majority samples included in each group of majority samples, for example: l ═ alpha₁L_s+α₂L_fIn which α is₁And alpha₂The weight information can be obtained according to historical data.

In this embodiment, the discrete difference L between the m majority samples included in each group of majority samples may be obtained according to the numerical characteristic, the descriptive characteristic, or a combination thereof, and may be adjusted according to an actual situation.

In one possible embodiment, the above is for n_sThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samples_sThe method also comprises the following steps: n from M majority samples_sNumerical characteristics of the dimension inn_sDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditions_sDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in n_sDimension dispersion degree, obtaining dispersion difference degree L of m most samples_s。

For example, for a few classes of samples Y_jLet m majority samples of a majority sample group for which the discrete difference is currently calculated be: (X)_j1，X_j2，...，X_jm) The numerical characteristics of the m majority samples are:

…；

in fact, each dimension of the numerical characteristic is a numerical value, and after standardization processing, each dimension has a value interval. And considering the numerical characteristics of all M majority samples, dividing a value interval into a plurality of cell intervals according to each dimension, and observing the distribution of the sampled M majority samples on the cell intervals so as to judge the discrete degree of the samples. Taking the first dimension of the numerical characteristic as an example, the first dimension of the numerical characteristic of all M majority samples is divided into k values in a value interval₁And (4) a small interval. The numerical characteristics of the first dimension of all m majority samples are

Falls on k₁The number of the cells is respectively

The following relationships apply:

consider that

At k₁The degree of dispersion between the cells can be determined by, for example, using the following formula

Wherein the content of the first and second substances,

value is (0, 1)]When in

Is uniformly distributed in k₁When in a cell, i.e.

At this time

Take the maximum value of 1.

For minority class sample Y_jThe discrete degree of the numerical characteristics of the m majority sample types is defined as:

it can be seen that L_sValue is (0, 1)]In between, the more discrete the numerical features of the majority of classes of samples, L_sThe closer the value is to 1.

In one possible embodiment, the above is for n_fThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_fThe method also comprises the following steps: for n_fDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elements_fDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in n_fDegree of dimension dispersion, obtaining degree of dispersion L of m majority samples_f。

For example, for a few classes of samples Y_jLet m majority samples of a majority sample group for which the discrete difference is currently calculated be: (X)_j1，X_j2，...，X_jm) The descriptive characteristics of the m majority samples are as follows:

…；

in fact, each dimension of the descriptive information reflects the attribute of the sample, and for the descriptive information of a certain dimension, the more discrete the attribute distribution of the sample, the more information the sample contains.

For example, with U_n{x₁，x₂，...，x_nDenotes the set x₁，x₂，...，x_nThe discrete distance of m majority samples in the first dimension of the descriptive feature can be determined by using the following formulaDegree of rotation

Wherein the content of the first and second substances,

is taken as (0, 1)]The more discrete the descriptive features of the first dimension,

the closer the value is to 1.

For minority class sample Y_jThe discrete degree of the description-type features of the m most sampled samples is defined as:

as can be seen,

value is (0, 1)]In between, the more discrete the numerical information of most classes of samples,

the closer the value is to 1.

Step 203, determining the distance D between any one minority sample and the m majority samples included in each group of majority samples.

For example, consider a sequence of samples from a majority class (X)_j1，X_j2，...，X_jM) M majority of selected samples (X)_ji+1，X_ji+2，...，X_ji+m) With minority class samples Y_jDetermining all selected m majority samples and minority samples Y_jSum of distances between:

D＝β*[d(Y_j，X_ji+1)+…+d(Y_j，X_ji+m)]wherein β is a weight parameter.

Next, step 204 is executed to determine the difference S of each group of majority samples according to the distance D and the dispersion difference L_m＝L/D。

In a possible implementation, the step 204 may further include: according to the numerical characteristics, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group: and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of Euclidean distances.

Next, step 205 is executed according to the difference S_mOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples. For example, the degree of difference S is selected_mThe largest set of majority sample groups is down sampled.

In one exemplary embodiment, in the transaction, the normal transaction is a majority transaction and may be scored as a majority sample, and the fraudulent transaction is a very few transaction and may be scored as a minority sample. All samples contain mainly two types of information: (1) descriptive characteristics: card number, merchant number, date, terminal number, etc.; (2) numerical characteristics: and obtaining characteristic data through the transaction message and the historical transaction information. The number of most samples is recorded as M, the number of few samples is recorded as N, and M is far greater than N.

Based on this, a plurality of samples are recorded

Where i 1, 2.., M, a small number of samples

Wherein j is 1, 2., N,

and

is a numerical characteristic of being n_sThe dimensional vectors are normalized,

and

is a descriptive feature, is an_fA vector of dimensions.

Each dimension of the numerical characteristic of each sample is a numerical characteristic which is comprehensively calculated by the transaction message, and the numerical characteristic comprises the numerical characteristics of money, average transaction money, transaction period intervals, transaction stroke number and the like and also comprises partial combined structure numerical characteristics. Then, the euclidean distance between any one of the majority class samples and any one of the minority class samples can be calculated using the following formula:

further, a majority of sample diversity functions may be constructed using the numerical and descriptive characteristics of the samples.

For the numerical characteristics of the sample, the amount, the average transaction amount, the transaction period interval, and the number of transaction strokes are taken as examples. For minority class sample Y_jCalculating a plurality of sample sequences X_j1，X_j2，X_j3，...，X_jMThe degree of dispersion of the numerical features of the m majority class samples. (1) Using amt_jiRepresenting the transaction amount of the sample, and dividing the value interval into k₁The number of cells falling into each cell is

(2) Use agv _ amt_jiRepresenting the average transaction amount of the sample, and dividing k by the value-taking interval₂The number of cells falling into each cell is

(3) Using t_jiRepresenting the trade period interval of the sample, dividing k by the value-taking interval₃The number of cells falling into each cell is

(4) Use nt_jiRepresenting the number of transactions of the sample, dividing k by the value-taking interval₄The number of cells falling into each cell is

I.e. the numerical characteristics of the sample are expressed as:

then for a numerical feature, the degree of dispersion of the multiple classes of samples can be expressed as:

for the descriptive characteristics of the sample, the card number, the merchant number and the transaction date are taken as examples, and the sample Y is used for a few types_jCalculating X_j1，X_j2，X_j3，...，X_jMThe m majority class samples in (a) describe the degree of dispersion of the type feature. Use of C_jiRepresents a sample X_jiCard number information of (1), M_jiRepresents a sample X_jiMerchant information of (1), T_jiRepresents a sample X_jiTime information of, i.e. sample X_jiThe descriptive characteristics of (a) are expressed as:

the degree of dispersion of the majority type of sample description features can be expressed as:

and combining the numerical characteristics and the descriptive characteristics to construct a discrete degree function of a plurality of types of samples:

L＝α₁*L_s+α₂*L_f

further consider that from X_j1，X_j2，X_j3，...，X_jMM majority samples and minority samples Y selected from the group_jThe distance of (c): d ═ β [ [ D ] (Y) ]_j，X_ji+1)+…+d(Y_j，X_ji+m)]All the selected m majority samples and minority samples Y_jThe sum of the distances between them.

In summary, based on the weight combination, the difference function may be:

optionally, a weighting parameter α may be taken₁，α₂β is 1 to simplify the calculation, i.e., the disparity function is:

optionally, for a few classes of samples Y_jThe M majority samples can be sorted from near to far as X according to Euclidean distance_j1，X_j2，X_j3，...，X_jMFurther, from the majority class sample sequence X_j1，X_j2，X_j3，...，X_jMThe first group is that m majority samples are selected: x_j1，X_j2，...，X_jm(ii) a Second group: x_j2，X_j3，...，X_jm+1(ii) a …, and the likeMultiple sets of majority sample sets can be obtained.

The difference degree of m majority samples in each majority sample group can be calculated respectively, and the difference degree S is selected_mThe largest group is the m majority samples of the final sample.

In fact, for all N minority class samples, the number of majority class samples sampled from M majority class samples is between [ M, N × M ]. According to the experience of pseudo-card model training, a positive minority sample ratio of 1000: 1 is generally selected, that is, 1000 majority samples need to be sampled for each minority sample. And finally, performing model training by using the sampled majority samples and all minority samples.

In another exemplary embodiment, a technical effect of the data processing method of the present application is exemplarily described. For example, for a financial institution, the average transaction number 175538 per day, the fraud transaction number 422, and the ratio of majority class/minority class samples is 37437: 1, which are obtained in the magnetic stripe card online environment of 90 days in 2019 in 4, 5, 6, and 7 months, and the majority class samples are down-sampled and model-trained according to the scheme of the above embodiment. All of the majority samples can be first down-sampled at a 50: 1 sampling ratio, sampling 3510 transactions per day. Specifically, assuming that the sample has 500-dimensional numerical features, wherein 21-dimensional descriptive information exists, the 21-dimensional numerical features (e.g., 2 merchant statistical features, 3 card-dimensional statistical features, 6 real-time statistical features, and 10 historical statistical features) are selected in consideration of the saturation, interval distribution, and other factors of each dimensional information, and the 4 pieces of dimensional information are combined from the 2-dimensional descriptive information (e.g., merchant number and transaction total amount) to perform the downsampling process in the above embodiment.

(1) Sample information saturation: referring to table 1, the saturation of 19 dimensional features is greater than that of the random sampling scheme, wherein the maximum is 40.7% higher, 7 are 30% higher, and two saturations are slightly lower than that of the random sampling scheme, for 21 dimensional features obtained by the down-sampling scheme in the present application. Of all 500 dimensions, 436 dimensions have higher saturation than random sampling.

Table 1:

(2) sample feature segmentation distribution:

referring to table 2, it can be seen that 20 segment distribution differences of 21 dimensional features obtained by sampling with the down-sampling scheme of the present application are all smaller than those of the random sampling scheme, which is reduced by 37.5% to the maximum and 30% to 4; with 1 segment distribution differing slightly more than the random sampling scheme.

Table 2:

(3) description of information combination ratio:

the assumed combination of description information is: the total amount of transaction in the merchant number +5 minutes, the total amount of transaction in the merchant number +15 minutes, the total amount of transaction in the merchant number +120 minutes, and the total amount of transaction in the merchant number +1 day. Referring to table 3, it can be seen that the degree of dispersion of the descriptive information sampled by the down-sampling scheme of the present application is greater than that of the random sampling scheme.

Table 3:

index (I)	Down sampling scheme	Random sampling scheme	Difference of difference
				Merchant number +5min total transaction amount	76.5％	74.8％	1.7％
Merchant number +15min total transaction amount	78.7％	76.1％	2.6％
				Merchant number +120min total transaction amount	81.8％	79.0％	2.8％

(4) The model training effect is as follows: substituting the data of months 4, 5 and 6 in 2019, namely the number of digits 383 of a few class samples into an xgboost model, wherein the parameters are as follows: max _ depth: 3; eta: 0.01; min _ child _ weight: 6; gamma: 0.1; lambda: 10; subsample: 0.8

With the same parameters above, see table 4, the behavior of the model on the training set is as follows:

table 4:

	train-auc	val-auc	trees
				down sampling scheme	0.941778	0.891961	568
Random sampling scheme	0.936219	0.891842	515

Referring to Table 5, the extrapolated validation effect on the 7-month data (38 minority samples) is:

table 5:

	down sampling scheme	Random sampling scheme
			Recall rate	Rate of accuracy	Rate of accuracy
5.1％	28.6％	33.3％
			10.3％	40.0％	19.0％
15.4％	10.0％	14.3％

Based on the same technical concept, an embodiment of the present invention further provides a data processing apparatus, configured to execute the data processing method provided in any of the above embodiments. Fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.

As shown in fig. 3, the apparatus 300 includes:

an obtaining module 301, configured to obtain a training sample set, where the training sample set includes M majority samples obtained according to normal data and N minority samples obtained according to abnormal data, M, N is a positive integer, and M is greater than N;

a down-sampling module 302, configured to determine M majority samples discretely distributed around each minority sample according to the minority samples and the dimensionality characteristics of the majority samples, so as to down-sample the majority samples, where M is smaller than M and is a positive integer;

and the training module 303 is configured to train the classification model according to the minority class samples and the down-sampled majority class samples.

And the processing module 304 is used for processing the data according to the classification model.

According to one possible implementation, the down-sampling module 302 is further configured to: sampling multiple combined multiple sample groups from the M multiple samples according to any one of the multiple samples, wherein each multiple sample group comprises M multiple samples; determining discrete differences between m majority samples contained in each set of majority samplesA degree L; determining the distance D between any one minority sample and m majority samples contained in each group of majority samples; determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree L_mL/D; according to the degree of difference S_mOne of the plurality of combined majority class sample sets is determined as m majority class samples discretely distributed around any one of the minority class samples.

According to one possible implementation, the down-sampling module 302 is further configured to: majority class samples and minority class samples comprise n_sNumerical features of dimension and/or n_fDescriptive data of the dimension; for n_sThe numerical characteristics of the dimension determine the degree of discrete difference L between the m majority samples contained in each group of majority samples_s(ii) a And/or, for n_fThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_f(ii) a And, according to the degree of variance L_sAnd/or a degree of variance L_fThe degree of discrete difference L between the m majority samples contained in each set of majority samples is determined.

According to one possible implementation, the down-sampling module 302 is further configured to: n from M majority samples_sNumerical characteristics of the dimension, at n_sDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension; determining the distribution situation of m majority samples contained in each group of majority samples among a plurality of cells; determining m majority samples at n according to distribution conditions_sDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in n_sDimension dispersion degree, obtaining dispersion difference degree L of m most samples_s。

According to one possible implementation, the down-sampling module 302 is further configured to: determining the discrete difference L of m majority samples by using the following formula_s：

Wherein n is_sIs a number ofDimension, k, of a value-type feature_tTo aim at n_sThe number of cells divided by each dimension in the dimension,

According to one possible implementation, the down-sampling module 302 is further configured to: for n_fDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension; determining m majority samples at n according to the number of minority elements_fDegree of dispersion in each dimension of a dimension; synthesizing m majority samples in n_fDegree of dimension dispersion, obtaining degree of dispersion L of m majority samples_f。

According to one possible implementation, the down-sampling module 302 is further configured to: determining the discrete difference L of m majority samples by using the following formula_f：

Wherein n is_fIn order to describe the dimensions of a type feature,

representation collection

Number of different elements in, set

Grouping a plurality of sample groups; determining

Minority sample Y_jIncluding n_sNumerical data of dimension

i＝1，2，...，M，j＝1，2，...，N。

It should be noted that the data processing apparatus in the embodiment of the present application may implement each process of the foregoing embodiment of the data processing method, and achieve the same effect and function, which is not described herein again.

Fig. 4 is a data processing apparatus according to an embodiment of the present application, configured to execute the data processing method shown in fig. 1, where the apparatus includes: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the data processing method shown in the above embodiments

According to some embodiments of the present application, there is provided a non-volatile computer storage medium of a data processing method having stored thereon computer-executable instructions configured to, when executed by a processor, perform the data processing method illustrated in the embodiments described above

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device, and computer-readable storage medium embodiments, the description is simplified because they are substantially similar to the method embodiments, and reference may be made to some descriptions of the method embodiments for their relevance.

The apparatus, the device, and the computer-readable storage medium provided in the embodiment of the present application correspond to the method one to one, and therefore, the apparatus, the device, and the computer-readable storage medium also have advantageous technical effects similar to those of the corresponding method.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A data processing method, comprising:

acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is greater than N;

according to the dimension characteristics of the minority samples and the majority samples, M majority samples which are distributed around each minority sample are determined to perform down-sampling on the majority samples, wherein M is smaller than M and is a positive integer;

training a classification model according to the minority samples and the majority samples after down-sampling;

and processing the data according to the classification model.

2. The method of claim 1, wherein determining m majority samples discretely distributed around each minority sample further comprises:

sampling multiple combined multiple sample groups from the M multiple samples for any one of the multiple samples, wherein each multiple sample group comprises M multiple samples;

determining a discrete difference degree L between m majority samples contained in each group of majority samples;

determining a distance D between the arbitrary one minority sample and the m majority samples included in each of the plurality of sets of majority samples;

determining the difference degree S of each group of majority sample groups according to the distance D and the dispersion difference degree L_m＝L/D；

According to the difference degree S_mDetermining one of the plurality of combined majority sample sets as a plurality m of samples discretely distributed around the any one minority sampleSeveral types of samples.

3. The method of claim 2, wherein determining the degree of discrete difference L between the m majority samples included in each set of majority samples comprises:

each dimension feature of the majority class sample and the minority class sample comprises n_sNumerical features of dimension and/or n_fDescriptive data of the dimension;

for n_sThe numerical characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_s(ii) a And/or the presence of a gas in the gas,

for n_fThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_f(ii) a And the number of the first and second groups,

according to the discrete difference degree L_sAnd/or the degree of variance L of the dispersion_fDetermining a degree of discrete difference L between m majority samples included in each set of majority samples.

4. The method of claim 3, wherein n is the lowest value of n_sThe numerical characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_sThe method also comprises the following steps:

n according to the M majority samples_sNumerical characteristics of the dimension, in said n_sDividing the numerical value interval into a plurality of small intervals on each dimension of the dimension;

determining the distribution situation of m majority samples contained in each group of majority samples among the plurality of cells;

determining the m majority samples in the n according to the distribution condition_sDegree of dispersion in each dimension of a dimension;

synthesizing the m majority samples at the n_sThe degree of dispersion of the dimension is obtained, and the degree of dispersion difference L of the m majority samples is obtained_s。

5. The method of claim 4, further comprising:

determining the discrete difference degree L of the m majority samples by using the following formula_s：

Wherein, said n_sIs a dimension of a numerical feature, said k_tTo aim at n_sThe number of cells divided by each dimension in the dimension, the

And the number of the majority samples of the m majority samples falling in each divided cell is determined.

6. The method of claim 3, wherein n is the lowest value of n_fThe descriptive characteristics of the dimension determine the discrete difference L between the m majority samples contained in each group of majority samples_fThe method also comprises the following steps:

for said n_fDetermining the number of minority class elements of m majority class samples contained in each group of majority class sample groups for each dimension in the descriptive feature of the dimension;

determining the number of m majority class samples in the n according to the number of the minority class elements_fDegree of dispersion in each dimension of a dimension;

synthesizing the m majority samples at the n_fThe degree of dispersion of the dimension is obtained, and the degree of dispersion L of the m majority samples is obtained_f。

7. The method of claim 6, further comprising:

determining the discrete difference degree L of the m majority samples by using the following formula_f：

Wherein, said n_fFor the dimension of the descriptive feature, the

Representation collection

Number of different elements in, said set

The descriptive features representing the m majority class samples are for a feature set of the same dimension.

8. The method of claim 5, wherein determining the degree of difference for each of the plurality of sets of samples based on the distance and the degree of variance comprises:

according to the numerical characteristic, determining the sum of Euclidean distances of any one minority sample and m majority samples in each majority sample group:

and determining the difference degree of each group of the majority sample groups as a weighted ratio of the discrete difference degree L and the sum of the Euclidean distances.

9. The method of claim 2, further comprising:

for each of the minority class samples, selecting any M majority class samples from all the M majority class samples as a group of majority class samples to obtain

Grouping a plurality of sample groups;

determining the

Degree of difference S of a plurality of sample groups_mAnd selecting a group of majority sample groups with the maximum difference as the sampling result corresponding to any one minority sample.

10. The method of claim 2, wherein the majority class samples and the minority class samples comprise n_sA numerical characterization of the dimension, the method further comprising:

for any one minority sample, calculating the distance between the M majority samples and the any one minority sample according to the numerical characteristic;

sequencing the M majority samples according to the distance to obtain a majority sample sequence;

selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence;

determining the degree of difference S of the q groups of majority sample groups_mAnd selecting a group of majority sample groups with the maximum difference as the sampling result corresponding to any one minority sample.

11. The method of claim 8 or 10, further comprising:

calculating any one of a plurality of samples X by using the following formula_iAnd any one of the minority samples Y_jEuclidean distance of d_ij：

Wherein the majority class sample X_iIncluding n_sNumerical characteristics of dimensions

The minority sample Y_jIncluding n_sNumerical data of dimension

12. A data processing apparatus, comprising:

an obtaining module, configured to obtain a training sample set, where the training sample set includes M majority samples obtained according to normal data and N minority samples obtained according to abnormal data, where M, N is a positive integer, and M is greater than N;

a down-sampling module, configured to determine M majority samples discretely distributed around each minority sample according to the minority sample and each dimensional feature of the majority sample, so as to down-sample the majority sample, where M is smaller than M and is a positive integer;

the training module is used for training a classification model according to the minority class samples and the majority class samples after down-sampling;

and the processing module is used for processing the data according to the classification model.

13. The apparatus of claim 12, wherein the downsampling module is further configured to:

determining the majority sample group according to the distance D and the dispersion difference LDegree of difference S_m＝L/D；

According to the difference degree S_mDetermining one of the plurality of combined majority sample sets as m majority samples discretely distributed around the any one minority sample.

14. The apparatus of claim 13, wherein the downsampling module is further configured to:

the majority class samples and the minority class samples comprise n_sNumerical features of dimension and/or n_fDescriptive data of the dimension;

15. The apparatus of claim 14, wherein the downsampling module is further configured to:

16. The apparatus of claim 15, wherein the downsampling module is further configured to:

17. The apparatus of claim 14, wherein the downsampling module is further configured to:

18. The apparatus of claim 17, wherein the downsampling module is further configured to:

Wherein, said n_fFor the dimension of the descriptive feature, the

Representation collection

Number of different elements in, said set

19. The apparatus of claim 13, wherein determining the degree of difference for each of the plurality of sets of samples according to the distance and the degree of variance comprises:

20. The apparatus of claim 13, further comprising:

Grouping a plurality of sample groups;

determining the

21. The apparatus of claim 13, wherein the majority class samples and the minority class samples comprise n_sA numerical characterization of the dimension, the apparatus further comprising:

22. The apparatus of claim 19 or 21, further comprising:

The minority sample Y_jIncluding n_sNumerical data of dimension

23. A data processing apparatus, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform: the method of any one of claims 1-11.

24. A computer-readable storage medium storing a program that, when executed by a multi-core processor, causes the multi-core processor to perform the method of any of claims 1-11.