CN112001425B

CN112001425B - Data processing method, device and computer readable storage medium

Info

Publication number: CN112001425B
Application number: CN202010743665.7A
Authority: CN
Inventors: 马振伟; 邹勇; 林芃; 孙浩然; 肖鹰东
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2024-05-03
Anticipated expiration: 2040-07-29
Also published as: CN112001425A

Abstract

The invention provides a data processing method, a device, a system and a computer readable storage medium, wherein the method comprises the following steps: acquiring a training sample set, wherein the training sample set comprises M majority samples and N minority samples, M, N is a positive integer, and M is larger than N; according to the minority class samples and the dimension characteristics of the majority class samples, M majority class samples which are discretely distributed around each minority class sample are determined so as to downsample the majority class samples, wherein M is smaller than M and is a positive integer; training a classification model according to the minority class samples and the majority class samples after downsampling; and processing the data according to the classification model. By using the method, all information of the minority samples can be reserved, and the majority samples subjected to downsampling are discretely distributed around the minority samples, so that the characteristic with differentiation can be better reserved, and a classification model with more accurate classification effect can be obtained through training.

Description

Data processing method, device and computer readable storage medium

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a data processing method, a data processing device and a computer readable storage medium.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In machine learning modeling, there are many unbalanced data sets, i.e., the sample ratios of different classes are very different, such as in classification information recommendation, image processing, transaction data analysis models, etc., the abnormal sample ratio is only one ten thousandth, even one hundred thousand.

In processing unbalanced data, there are two most common methods: oversampling and undersampling. The former is to reserve all majority class samples, the minority class samples are repeatedly sampled randomly, the latter is to reserve all minority class samples, and part of majority class samples are not repeatedly sampled randomly, and the purpose of the final class of samples is not unbalanced. Such random sampling can lead to loss of sample information, resulting in model learning with no distinguishing features, thereby affecting model effect and further affecting accuracy of data processing.

Disclosure of Invention

In order to solve the problems in the prior art, a data processing method, a data processing device and a computer readable storage medium are provided, and the problems can be solved by using the method, the device and the computer readable storage medium.

The present invention provides the following.

In a first aspect, a data processing method is provided, including: acquiring a training sample set, wherein the training sample set comprises M majority class samples acquired according to normal data and N minority class samples acquired according to abnormal data, M, N is a positive integer, and M is larger than N; according to the minority class samples and the dimension characteristics of the majority class samples, M majority class samples which are discretely distributed around each minority class sample are determined so as to downsample the majority class samples, wherein M is smaller than M and is a positive integer; and processing the data according to the classification model by carrying out classification model according to the minority class samples and the majority class samples after downsampling.

According to one possible embodiment, determining m majority class samples that are discretely distributed around each minority class sample further comprises: sampling a plurality of combined majority class sample groups from M majority class samples for any one minority class sample, wherein each group of majority class sample groups comprises M majority class samples; determining a degree of discrete difference L between m majority class samples contained in each group of majority class sample groups; determining a sum D of distances between any one minority class sample and m majority class samples contained in each group of majority class samples; determining the difference degree S _m =L/D of each multi-class sample group according to the distance sum D and the discrete difference degree L; one of a plurality of combined majority class sample sets is determined according to the degree of difference S _m as m majority class samples which are discretely distributed around any one minority class sample.

According to one possible embodiment, determining the degree of discrete difference L between m majority class samples contained in each set of majority class samples comprises: each dimensional feature of the majority class samples and the minority class samples comprises a numerical feature of n _s dimensions and/or a descriptive feature of n _f dimensions; for a numerical feature of n _s dimensions, determining a degree of discrete difference L _s between m majority class samples contained in each set of majority class sample sets; and/or determining a degree of discrete difference L _f between m majority class samples included in each set of majority class sample groups for the n _f -dimensional descriptive feature; and determining the discrete difference degree L between m majority class samples contained in each group of majority class sample groups according to the discrete difference degree L _s and/or the discrete difference degree L _f.

According to one possible embodiment, for a numerical feature of dimension n _s, determining the degree of discrete difference L _s between m majority class samples included in each set of majority class samples further includes: dividing a numerical interval into a plurality of cells in each dimension of n _s dimensions according to the numerical characteristics of n _s dimensions of M majority class samples; determining distribution conditions of m majority class samples contained in each group of majority class sample groups among a plurality of cells; determining the discrete degree of m majority samples in each dimension of n _s dimensions according to the distribution condition; and synthesizing the discrete degrees of m majority class samples in n _s dimensions, and obtaining the discrete difference degree L _s of the m majority class samples.

According to one possible embodiment, further comprising: the discrete variance L _s of m majority class samples is determined using the following formula:

Where n _s is the dimension of the numerical feature, k _t is the number of inter-cells divided for each of the n _s dimensions, The number of majority class samples for m majority class samples falling between each divided cell.

According to one possible implementation, for the descriptive feature of n _f dimensions, determining the degree of discrete difference L _f between m majority class samples contained in each set of majority class samples further includes: for each dimension in the n _f -dimensional descriptive feature, determining the number of minority class elements of m majority class samples contained in each group of majority class sample groups; determining the discrete degree of m majority class samples in each dimension of n _f dimensions according to the number of minority class elements; and synthesizing the discrete degrees of m majority class samples in n _f dimensions, and obtaining the discrete degrees L _f of the m majority class samples.

According to one possible embodiment, further comprising: the discrete variance L _f of m majority class samples is determined using the following formula:

Where n _f is the dimension of the descriptive feature, Representing a collectionThe number, set/>, of different elements in a setA feature set representing descriptive features of m majority class samples for the same dimension.

According to one possible embodiment, determining the degree of difference for each of the plurality of classes of sample sets based on the distance and the degree of discrete difference comprises: determining the sum of Euclidean distances between any one minority sample and m majority samples in each majority sample group according to the numerical value type characteristics: the variance of each plurality of sample groups is determined as a weighted ratio of the sum of the discrete variance L and the Euclidean distance.

According to one possible embodiment, the method further comprises: for each minority class sample, selecting any M majority class samples from all M majority class samples as a group of majority class samples to obtainA group of majority class samples; determination ofAnd the difference degree S _m of the plurality of sample groups is set, and a plurality of sample group with the largest difference degree is selected as a sampling result corresponding to any one of the plurality of sample groups.

According to one possible embodiment, the majority class samples and minority class samples comprise n _s -dimensional numerical features, the method further comprising: for any one minority sample, calculating the distances between M majority samples and any one minority sample according to the numerical value characteristics; sequencing M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; and determining the difference degree S _m of the q groups of majority sample groups, and selecting a group of majority sample groups with the largest difference degree as a sampling result corresponding to any one minority sample.

According to one possible embodiment, further comprising: the Euclidean distance d _ij of any one of the majority class samples X _i and any one of the minority class samples Y _j is calculated using the following formula:

Wherein the majority sample X _i includes n _s -dimensional numerical features Minority class sample Y _j includes n _s -dimensional numerical features/>

In a second aspect, there is provided a data processing apparatus comprising: the acquisition module is used for acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is larger than N; the downsampling module is used for determining M majority samples which are discretely distributed around each minority sample according to the minority samples and each dimension characteristic of the majority samples so as to downsample the majority samples, wherein M is smaller than M and is a positive integer; the training module is used for training the classification model according to the minority class samples and the majority class samples after downsampling; and the processing module is used for processing the data according to the classification model.

According to one possible implementation, the downsampling module is further configured to: sampling a plurality of combined majority class sample groups from M majority class samples for any one minority class sample, wherein each group of majority class sample groups comprises M majority class samples; determining a degree of discrete difference L between m majority class samples contained in each group of majority class sample groups; determining a sum D of distances between any one minority class sample and m majority class samples contained in each group of majority class samples; determining the difference degree S _m =L/D of each multi-class sample group according to the distance sum D and the discrete difference degree L; one of a plurality of combined majority class sample sets is determined according to the degree of difference S _m as m majority class samples which are discretely distributed around any one minority class sample.

According to one possible implementation, the downsampling module is further configured to: the majority class samples and minority class samples include numerical features of n _s dimensions and/or descriptive features of n _f dimensions; for a numerical feature of n _s dimensions, determining a degree of discrete difference L _s between m majority class samples contained in each set of majority class sample sets; and/or determining a degree of discrete difference L _f between m majority class samples included in each set of majority class sample groups for the n _f -dimensional descriptive feature; and determining the discrete difference degree L between m majority class samples contained in each group of majority class sample groups according to the discrete difference degree L _s and/or the discrete difference degree L _f.

According to one possible implementation, the downsampling module is further configured to: dividing a numerical interval into a plurality of cells in each dimension of n _s dimensions according to the numerical characteristics of n _s dimensions of M majority class samples; determining distribution conditions of m majority class samples contained in each group of majority class sample groups among a plurality of cells; determining the discrete degree of m majority samples in each dimension of n _s dimensions according to the distribution condition; and synthesizing the discrete degrees of m majority class samples in n _s dimensions, and obtaining the discrete difference degree L _s of the m majority class samples.

According to one possible implementation, the downsampling module is further configured to: the discrete variance L _s of m majority class samples is determined using the following formula:

According to one possible implementation, the downsampling module is further configured to: for each dimension in the n _f -dimensional descriptive feature, determining the number of minority class elements of m majority class samples contained in each group of majority class sample groups; determining the discrete degree of m majority class samples in each dimension of n _f dimensions according to the number of minority class elements; and synthesizing the discrete degrees of m majority class samples in n _f dimensions, and obtaining the discrete degrees L _f of the m majority class samples.

According to one possible implementation, the downsampling module is further configured to: the discrete variance L _f of m majority class samples is determined using the following formula:

According to one possible embodiment, the apparatus further comprises: for each minority class sample, selecting any M majority class samples from all M majority class samples as a group of majority class samples to obtainA group of majority class samples; determination ofAnd the difference degree S _m of the plurality of sample groups is set, and a plurality of sample group with the largest difference degree is selected as a sampling result corresponding to any one of the plurality of sample groups.

According to one possible embodiment, the majority class samples and minority class samples comprise n _s -dimensional numerical features, the apparatus further comprising: for any one minority sample, calculating the distances between M majority samples and any one minority sample according to the numerical value characteristics; sequencing M majority samples according to the distance to obtain a majority sample sequence; selecting q groups of majority sample groups from the majority sample sequence, wherein each group of majority sample groups comprises m majority samples, and the m majority samples are adjacent to each other in the majority sample sequence; and determining the difference degree S _m of the q groups of majority sample groups, and selecting a group of majority sample groups with the largest difference degree as a sampling result corresponding to any one minority sample.

In a third aspect, there is provided a data processing apparatus comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform: the method of the first aspect.

In a fourth aspect, there is provided a computer readable storage medium storing a program which, when executed by a multi-core processor, causes the multi-core processor to perform a method as in the first aspect.

The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects: in this embodiment, all information of a minority class sample acquired according to abnormal data is reserved, a majority class sample acquired according to normal data is subjected to downsampling, and the majority class sample after downsampling is discretely distributed around the minority class sample by using dimension information of the sample, so that distinguishing characteristics can be better learned during model training of classification data processing (such as information recommendation or image processing).

It should be understood that the foregoing description is only an overview of the technical solutions of the present invention, so that the technical means of the present invention may be more clearly understood and implemented in accordance with the content of the specification. The following specific embodiments of the present invention are described in order to make the above and other objects, features and advantages of the present invention more comprehensible.

Drawings

The advantages and benefits described herein, as well as other advantages and benefits, will become apparent to those of ordinary skill in the art upon reading the following detailed description of the exemplary embodiments. The drawings are only for purposes of illustrating exemplary embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a flow chart of a data processing method according to an embodiment of the invention;

FIG. 2 is a flow chart of a data processing method according to another embodiment of the invention;

FIG. 3 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural view of a data processing apparatus according to still another embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the present invention, it should be understood that terms such as "comprises" or "comprising," etc., are intended to indicate the presence of features, numbers, steps, acts, components, portions, or combinations thereof disclosed in the specification, and are not intended to exclude the possibility of the presence of one or more other features, numbers, steps, acts, components, portions, or combinations thereof.

In addition, it should be noted that, without conflict, the embodiments of the present invention and the features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

Those skilled in the art will appreciate that the application scenario described is but one example in which embodiments of the present invention may be implemented. The application scope of the embodiments of the present invention is not limited in any way. Having described the basic principles of the present invention, various non-limiting embodiments of the invention are described in detail below.

FIG. 1 is a flow chart of a data processing method 100 for implementing optimized sampling of training samples according to an embodiment of the present application, in which, from a device perspective, an execution subject may be one or more electronic devices; from the program perspective, the execution subject may be a program mounted on these electronic devices, accordingly.

As shown in fig. 1, the method 100 may include:

Step 101, acquiring a training sample set, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is larger than N;

Step 102, determining M majority samples which are discretely distributed around each minority sample according to the minority samples and each dimension characteristic of the majority samples so as to downsample the majority samples, wherein M is smaller than M and is a positive integer;

Step 103, training a classification model according to the minority class samples and the majority class samples after downsampling.

And 104, processing the data according to the classification model.

The classification model can be various classification models such as a classification information recommendation model, an image classification processing model, a transaction data analysis model and the like. In the processing of the unbalanced training sample set, the number of training samples in different categories is greatly different, and the majority of the training samples are far more than the minority of the training samples.

For example, in training samples of an image classification model for lesion recognition, the number of samples of a majority class of samples acquired from normal data (such as image data of healthy organs) is much larger than the number of samples of a minority class of samples acquired from abnormal data (such as image data of diseased organs). For another example, in training samples of the transaction data analysis model, the number of samples of the majority class of samples obtained from normal data (e.g., normal transaction data) is much greater than the minority class of samples obtained from abnormal data (e.g., fraudulent transaction data). Each sample contains a plurality of dimensional features such as numerical data, date data, category data, textual description data, and the like.

The application can divide the sample characteristics into two categories: (1) numerical characteristics: the distance between samples can be quantized after normalization processing and can also be used for model training after sampling. (2) description type features: it is difficult to numerically process data having a class distinction effect, such as date data, class data, text description data, and the like.

In this embodiment, all minority samples are reserved, and each dimension characteristic of the samples is fully utilized, and m majority samples which are discretely surrounded around each minority sample are selected for each minority sample, so that distinguishing characteristics can be better learned during model training. Specifically, selecting m majority class samples for each minority class sample may have the following characteristics: (1) The sample is as close as possible to the minority sample, and certain similarity is kept between the sample and the minority sample; (2) Under the condition of the same distance from a few samples, the samples are dispersed as far as possible so as to retain more sample information; (3) Sample information dispersion and sample distance are combined to ensure sampling stability.

In one possible implementation, the step 102 may further include:

step 201, for any one minority class sample, multiple combined majority class sample groups are sampled from M majority class samples, where each group of majority class sample groups includes M majority class samples, and M is a positive integer and less than M.

In a possible implementation manner, based on the global optimization idea, the step 201 may further include: for each minority class sample, selecting any M majority class samples from all M majority class samples as a group of majority class samples to obtainA group of majority class samples; determination/>And the difference degree S _m of the plurality of sample groups is set, and a plurality of sample group with the largest difference degree is selected as a sampling result corresponding to any one of the plurality of sample groups.

In practice, m majority samples are sampled for each minority sample, so as to obtain n×m majority samples, and considering that the same majority sample can be collected by different minority samples, the number of the majority samples obtained by final sampling is between [ m, n×m ].

In one possible embodiment, the majority class sample X _i includes n _s -dimensional numerical featuresMinority class sample Y _j includes n _s -dimensional numerical featuresBased on the partially optimal idea, the step 201 may further include: aiming at any one minority sample Y _j, calculating the distances between M majority samples and any one minority sample according to the numerical value characteristics; sequencing M majority samples according to the distance to obtain a majority sample sequence (X _j1,X_j2,...,X_jM); q sets of majority class sample sets are selected from the majority class sample sequence, wherein each set of majority class sample sets comprises m majority class samples, and the m majority class samples are adjacent to each other in the majority class sample sequence. For example: group 1 is selected as (X _j1,X_j2,...,X_jm), the second group is selected as (X _j2,X_j3,...,X_jm +1), and so on. The difference degree S _m of q groups of majority sample groups can be further determined, and a group of majority sample groups with the largest difference degree is selected as m majority samples which are discretely distributed around any one minority sample group.

In one possible implementation, the Euclidean distance d _ij of any one of the majority class samples X _i and any one of the minority class samples Y _j can be calculated using the following formula:

Step 202, determining a degree of discrete difference L between m multi-class samples included in each multi-class sample group.

In one possible implementation, the majority class samples and minority class samples include numerical features of n _s dimensions and/or descriptive features of n _f dimensions; the step 202 may further include:

For a numerical feature of n _s dimensions, the degree of discrete difference L _s between m majority class samples contained in each set of majority class sample sets is determined. And/or determining a degree of discrete difference L _f between m majority class samples included in each set of majority class sample groups for the n _f -dimensional descriptive feature; and determining a degree of discrete difference L between m majority class samples included in each group of majority class sample groups from the degree of discrete difference L _s and/or the degree of discrete difference L _f, for example: l=α ₁L_s+α₂L_f, where α ₁ and α ₂ are weight information, and can be obtained from historical data.

In this embodiment, the discrete difference degree L between m majority samples included in each majority sample group may be obtained according to a numerical feature, a descriptive feature, or a combination thereof, and may be adjusted according to actual situations.

In one possible implementation manner, the determining the degree of discrete difference L _s between m majority class samples included in each group of majority class samples for the numerical feature of n _s dimensions may further include: dividing a numerical interval into a plurality of cells in each dimension of n _s dimensions according to the numerical characteristics of n _s dimensions of M majority class samples; determining distribution conditions of m majority class samples contained in each group of majority class sample groups among a plurality of cells; determining the discrete degree of m majority samples in each dimension of n _s dimensions according to the distribution condition; and synthesizing the discrete degrees of m majority class samples in n _s dimensions, and obtaining the discrete difference degree L _s of the m majority class samples.

For example, for the minority class sample Y _j, let m majority class samples of one majority class sample group that currently calculates the discrete variance be: (X _j1,X_j2,...,X_jm) the numerical characteristics of the m majority class samples are:

…；

In practice, each dimension of the numerical feature is a numerical value, and after the normalization processing, each dimension has a value interval. Considering the numerical characteristics of all M multi-class samples, dividing a value interval into a plurality of cells for each dimension, and observing the distribution of the sampled M multi-class samples among the cells so as to judge the discrete degree of the samples. Taking the first dimension of the numerical feature as an example, the first dimension of the numerical feature of all M majority samples is divided into k ₁ cells in a value interval. The numerical value type of the first dimension of all m majority samples is characterized by The number falling between k ₁ cells is/>, respectivelyThen there is the following relationship: /(I)ConsiderThe degree of dispersion over k ₁ cells, for example, the degree of dispersion/>, of m majority class samples in the first dimension of the numerical feature can be determined using the following equation

Wherein,The value is between (0, 1), when/>When uniformly distributed across k ₁ cells, i.e.At this time/>Take the maximum value 1.

For the minority class sample Y _j, the degree of dispersion of the sampled m majority class sample numerical features is defined as:

It can be seen that the value of L _s is between (0, 1), the more discrete the numerical features of the majority class of samples, the closer the value of L _s is to 1.

In a possible implementation manner, the determining the degree of discrete difference L _f between m multiple class samples included in each multiple class sample group for the description type feature in n _f dimensions may further include: for each dimension in the n _f -dimensional descriptive feature, determining the number of minority class elements of m majority class samples contained in each group of majority class sample groups; determining the discrete degree of m majority class samples in each dimension of n _f dimensions according to the number of minority class elements; and synthesizing the discrete degrees of m majority class samples in n _f dimensions, and obtaining the discrete degrees L _f of the m majority class samples.

For example, for the minority class sample Y _j, let m majority class samples of one majority class sample group that currently calculates the discrete variance be: (X _j1,X_j2,...,X_jm) the descriptive characteristics of the m majority class samples are:

…；

In fact, each dimension of the descriptive information reflects the attribute of the sample, and for the descriptive information of a certain dimension, the more discrete the attribute distribution of the sample is, the more information the sample contains.

For example, using U _n{x₁,x₂,...,x_n to represent the number of different elements in the set { x ₁,x₂,...,x_n }, the degree of discretization of m majority class samples in the first dimension of the descriptive feature can be determined using the following equation

Wherein,The more discrete the descriptive features of the first dimension are, the greater the value of (0, 1)The closer the value is to 1.

For the minority class sample Y _j, the degree of discretization of the sampled m majority class sample descriptive features is defined as:

it can be seen that the light source is, The numerical value information of most types of samples is more discrete when the value is between (0, 1)/>The closer the value is to 1.

Step 203, determining a sum D of distances between any one minority class sample and m majority class samples contained in each group of majority class samples.

For example, considering the distances of m majority class samples (X _ji+1,X_ji+2,...,X_ji+m) from minority class sample Y _j selected from the sequence of majority class samples (X _j1,X_j2,...,X_jM), the sum of the distances between all selected m majority class samples and minority class sample Y _j is determined:

d=β x [ D (Y _j,X_ji+1)+…+d(Y_j,X_ji+m) ], where β is a weight parameter.

Next, step 204 is performed to determine the variance S _m =l/D of each of the plurality of sample groups according to the distance sum D and the discrete variance L.

In one possible implementation, the step 204 may further include: determining the sum of Euclidean distances between any one minority sample and m majority samples in each majority sample group according to the numerical value type characteristics: the variance of each plurality of sample groups is determined as a weighted ratio of the sum of the discrete variance L and the Euclidean distance.

Next, step 205 is performed to determine one of the multiple combined majority class sample groups according to the degree of difference S _m, as m majority class samples discretely distributed around any one minority class sample. For example, a set of multiple classes of samples with the largest degree of difference S _m is selected for downsampling.

In one exemplary embodiment, in a transaction, a normal transaction is a majority transaction, which may be referred to as a majority sample, and a fraudulent transaction is a minority transaction, which may be referred to as a minority sample. All samples contain mainly two types of information: (1) description type features: card number, merchant number, date, terminal number, etc.; (2) numerical characteristics: characteristic data obtained through transaction messages and historical transaction information. Wherein, the number of the most samples is counted as M, and the number of the few samples is counted as N, wherein M is far greater than N.

Based on this, record most of the samplesWhere i=1, 2, M, minority class samplesWhere j=1, 2,..And Y _j ^S is a numerical feature, an n _s -dimensional numerical vector, and all have been normalized,/>And Y _j ^f is a descriptive feature, an n _f -dimensional vector.

The dimensions of the numerical characteristics of each sample are numerical characteristics calculated by integrating the transaction message, and the numerical characteristics comprise sum, average transaction sum, transaction period interval, transaction number and the like, and also comprise numerical characteristics of partial combination structure. Then, the euclidean distance of any one of the majority class samples and any one of the minority class samples can be calculated using the following formula:

Further, a majority class sample variability function may be constructed using both numeric and descriptive features of the sample.

For the numerical characteristics of the sample, take the amount, average transaction amount, transaction period interval, and transaction number as an example. For the minority class sample Y _j, the degree of dispersion of the numerical features of the m majority class samples in the majority class sample sequence X _j1,X_j2,X_j3,...,X_jM is calculated. (1) Using amt _ji to represent transaction amount of sample, dividing value interval into k ₁ cells, and number of cells is(2) Using agv _amt _ji to represent the average transaction amount of the sample, dividing k ₂ cells according to the value interval, and the number falling into each cell is/>(3) Using t _ji to represent the transaction period interval of the sample, dividing k ₃ cells according to the value interval, wherein the number of the cells is/>(4) Using nt _ji to represent the transaction number of the sample, dividing k ₄ cells according to the value interval, wherein the number of the cells is/>

I.e. the numerical characteristics of the sample are expressed as:

the degree of discretization of the multiple classes of samples for a numeric feature can be expressed as:

For the descriptive characteristics of the samples, taking a card number, a merchant number and a transaction date as examples, the discrete degree of the descriptive characteristics of m majority samples in X _j1,X_j2,X_j3,...,X_jM is calculated for minority samples Y _j. Using C _ji to represent the card number information of sample X _ji, M _ji to represent the merchant information of sample X _ji, T _ji to represent the time information of sample X _ji, i.e., the descriptive characteristics of sample X _ji are expressed as:

The degree of discretization of the most sample-like descriptive feature can be expressed as:

combining the numerical type features and the descriptive type features, constructing a discrete degree function of a plurality of types of samples:

L＝α₁*L_s+α₂*L_f

Further considering the distance between m majority class samples selected from X _j1,X_j2,X_j3,…,X_jM and minority class sample Y _j: d=β x [ D (Y _j,X_ji+1)+…+d(Y_j,X_ji+m) ], i.e. the sum of the distances between all the selected m majority class samples and minority class sample Y _j.

In summary, based on the weight combination, the difference function may be:

alternatively, the weight parameter α ₁,α₂, β is 1 to simplify the calculation, i.e., the difference function is:

Optionally, for the minority class samples Y _j, M majority class samples may be sorted from near to far into X _j1,X_j2,X_j3,...,X_jM according to the euclidean distance, further, M majority class samples are selected from the majority class sample sequence X _j1,X_j2,X_j3,...,X_jM, the first group: x _j1,X_j2,...,X_jm; second group: x _j2,X_j3,...,X_jm+1; …, and so on, multiple sets of majority sample sets may be obtained.

The difference degree of m multi-class samples contained in each multi-class sample group can be calculated respectively, and a group with the largest difference degree S _m is selected as m multi-class samples to be finally sampled.

In practice, for all N minority class samples, the number of majority class samples is between [ M, n×m ] sampled from the M majority class samples. According to the experience of pseudo-card model training, a positive minority class sample ratio of 1000:1 is generally selected, namely 1000 majority class samples are required to be sampled for each minority class sample. And finally, using the sampled majority samples and all minority samples to perform model training.

In another exemplary embodiment, technical effects of the data processing method of the present application are exemplarily described. For example, for a financial institution, the transaction under the magnetic stripe card line of part 90 days in 2019, 4, 5, 6 and 7 months is obtained, the average daily transaction number 175538 and the fraud transaction number 422 are obtained, the sample ratio of the majority class to the minority class is 37437:1, and the sample of the majority class is subjected to downsampling treatment and then model training according to the scheme of the embodiment. All of the majority samples may be downsampled first at a 50:1 sampling ratio, sampling 3510 transactions per day. Specifically, it is assumed that the sample has 500-dimensional numerical features, in which 21-dimensional descriptive information is included, and the factors such as saturation and interval distribution of each dimensional information are considered, and 21-dimensional numerical features (for example, 2 merchant statistics features, 3 card dimension statistics features, 6 real-time statistics features and 10 history statistics features) are selected, and the 2-dimensional descriptive information (for example, merchant number and transaction total amount) is combined to form 4-dimensional information, so as to perform the downsampling processing in the above embodiment.

(1) Sample information saturation: referring to table 1, the 21 dimensional features sampled by the downsampling scheme of the present application have 19 dimensional features with saturation greater than the random sampling scheme, with the maximum 40.7% higher, 7 30% higher, and two slightly lower saturation than the random sampling scheme. 436 of all 500 dimensions are more saturated than random sampling.

Table 1:

/>

(2) Sample feature segment distribution:

Referring to table 2, it can be seen that the 21 dimensional characteristics obtained by sampling according to the downsampling scheme of the present application have 20 segment distribution differences smaller than those of the random sampling scheme, with a maximum reduction of 37.5% and a reduction of 4 by 30%; wherein there is a1 segment distribution difference slightly higher than the random sampling scheme.

Table 2:

/>

(3) Description information combination ratio:

Assume that the description information combination employed is: total transaction amount in merchant number +5 min, total transaction amount in merchant number +15 min, total transaction amount in merchant number +120 min, total transaction amount in merchant number +1 day. Referring to table 3, it can be seen that the degree of dispersion of descriptive information sampled by the downsampling scheme of the present application is greater than that of the random sampling scheme.

Table 3:

Index (I)	Downsampling scheme	Random sampling scheme	Differential motion
				Merchant number +5min total transaction	76.5％	74.8％	1.7％
Merchant number +15min total transaction	78.7％	76.1％	2.6％
				Merchant number +120min total transaction	81.8％	79.0％	2.8％

(4) Model training effect: data of 2019, 4, 5 and 6 months, namely a few sample numbers 383, are substituted into xgboost models, and parameters are as follows: max_depth:3, a step of; eta:0.01; min_child_weight:6, preparing a base material; gamma:0.1; lambda:10; subsample:0.8

With the same parameters above, see table 4, the model performs as follows on the training set:

Table 4:

	train-auc	val-auc	trees
				Downsampling scheme	0.941778	0.891961	568
Random sampling scheme	0.936219	0.891842	515

Referring to table 5, the extrapolation validation effect on 7 month data (38 minority class samples) is:

table 5:

	Downsampling scheme	Random sampling scheme
			Recall rate of recall	Accuracy rate of	Accuracy rate of
5.1％	28.6％	33.3％
			10.3％	40.0％	19.0％
15.4％	10.0％	14.3％

Based on the same technical concept, the embodiment of the present invention further provides a data processing device, which is configured to execute the data processing method provided in any one of the above embodiments. Fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.

As shown in fig. 3, the apparatus 300 includes:

the obtaining module 301 is configured to obtain a training sample set, where the training sample set includes M majority samples obtained according to normal data and N minority samples obtained according to abnormal data, M, N is a positive integer, and M is greater than N;

the downsampling module 302 is configured to determine M majority samples that are discretely distributed around each minority sample according to the minority samples and respective dimension characteristics of the majority samples, so as to downsample the majority samples, where M is smaller than M and is a positive integer;

The training module 303 is configured to train the classification model according to the minority class samples and the downsampled majority class samples.

And the processing module 304 is used for processing the data according to the classification model.

According to one possible implementation, the downsampling module 302 is further configured to: sampling a plurality of combined majority class sample groups from M majority class samples for any one minority class sample, wherein each group of majority class sample groups comprises M majority class samples; determining a degree of discrete difference L between m majority class samples contained in each group of majority class sample groups; determining a sum D of distances between any one minority class sample and m majority class samples contained in each group of majority class samples; determining the difference degree S _m =L/D of each multi-class sample group according to the distance sum D and the discrete difference degree L; one of a plurality of combined majority class sample sets is determined according to the degree of difference S _m as m majority class samples which are discretely distributed around any one minority class sample.

According to one possible implementation, the downsampling module 302 is further configured to: the majority class samples and minority class samples include numerical features of n _s dimensions and/or descriptive features of n _f dimensions; for a numerical feature of n _s dimensions, determining a degree of discrete difference L _s between m majority class samples contained in each set of majority class sample sets; and/or determining a degree of discrete difference L _f between m majority class samples included in each set of majority class sample groups for the n _f -dimensional descriptive feature; and determining the discrete difference degree L between m majority class samples contained in each group of majority class sample groups according to the discrete difference degree L _s and/or the discrete difference degree L _f.

According to one possible implementation, the downsampling module 302 is further configured to: dividing a numerical interval into a plurality of cells in each dimension of n _s dimensions according to the numerical characteristics of n _s dimensions of M majority class samples; determining distribution conditions of m majority class samples contained in each group of majority class sample groups among a plurality of cells; determining the discrete degree of m majority samples in each dimension of n _s dimensions according to the distribution condition; and synthesizing the discrete degrees of m majority class samples in n _s dimensions, and obtaining the discrete difference degree L _s of the m majority class samples.

According to one possible implementation, the downsampling module 302 is further configured to: the discrete variance L _s of m majority class samples is determined using the following formula:

Where n _s is the dimension of the numerical feature, k _t is the number of inter-cells divided for each of the n _s dimensions, The number of majority class samples for m majority class samples falling between each divided cell. /(I)

According to one possible implementation, the downsampling module 302 is further configured to: for each dimension in the n _f -dimensional descriptive feature, determining the number of minority class elements of m majority class samples contained in each group of majority class sample groups; determining the discrete degree of m majority class samples in each dimension of n _f dimensions according to the number of minority class elements; and synthesizing the discrete degrees of m majority class samples in n _f dimensions, and obtaining the discrete degrees L _f of the m majority class samples.

According to one possible implementation, the downsampling module 302 is further configured to: the discrete variance L _f of m majority class samples is determined using the following formula:

It should be noted that, the data processing apparatus in the embodiment of the present application may implement each process of the foregoing embodiment of the data processing method, and achieve the same effects and functions, which are not described herein again.

Fig. 4 is a data processing apparatus for performing the data processing method shown in fig. 1 according to an embodiment of the present application, the apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing methods described in the above embodiments

According to some embodiments of the present application, there is provided a non-volatile computer storage medium having stored thereon computer-executable instructions configured, when executed by a processor, to perform the data processing method as shown in the above embodiments

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for apparatus, devices and computer readable storage medium embodiments, the description thereof is simplified as it is substantially similar to the method embodiments, as relevant points may be found in part in the description of the method embodiments.

The apparatus, the device, and the computer readable storage medium provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the apparatus, the device, and the computer readable storage medium also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the apparatus, the device, and the computer readable storage medium are not repeated herein.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of data processing, comprising:

Acquiring a training sample set in an image classification processing scene, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is larger than N;

Determining M majority samples which are discretely distributed around each minority sample according to the minority sample and each dimension characteristic of the majority sample so as to downsample the majority samples, wherein M is smaller than M and is a positive integer, and the dimension characteristic comprises a numerical characteristic with specific numerical value and/or a descriptive characteristic with class distinguishing function;

Training a classification model which is constructed in advance for the image classification processing scene according to the minority class samples and the majority class samples after downsampling;

And processing the data according to the classification model.

2. The method of claim 1, wherein determining m majority class samples that are discretely distributed around each minority class sample further comprises:

Sampling a plurality of combined majority class sample groups from the M majority class samples for any one minority class sample, wherein each group of majority class sample groups comprises M majority class samples;

Determining the discrete difference degree L between m majority samples contained in each sampled majority sample group based on the any minority sample and the numerical type characteristic and/or descriptive type characteristic of the corresponding majority sample group; and determining a sum D of distances between the arbitrary one minority class sample and m majority class samples included in each of the plurality of majority class sample groups;

determining a difference degree S _m =L/D of each plurality of sample groups according to the distance sum D and the discrete difference degree L;

and determining one of the plurality of combined majority class sample groups according to the difference S _m, and taking the one of the plurality of combined majority class sample groups as m majority class samples which are discretely distributed around any one minority class sample.

3. The method according to claim 2, wherein determining the degree of discrete difference L between m majority class samples included in each of the plurality of majority class sample groups comprises:

Each dimensional feature of the majority class sample and the minority class sample includes a numerical feature of n _s dimensions and/or a descriptive feature of n _f dimensions;

Dividing a value interval into a plurality of cells for each dimension in n _s dimensions according to the numerical value characteristics of n _s dimensions, and determining the discrete difference degree L _s among m majority samples contained in each group of majority sample groups based on the distribution of the sampled m majority samples among the cells; and/or the number of the groups of groups,

For the description type feature of n _f dimensions, for each dimension of n _f dimensions, determining a discrete difference degree L _f between m majority class samples contained in each group of majority class samples based on the determined number of minority class elements of the m majority class samples; and

And determining the discrete difference L between m majority class samples contained in each group of majority class sample groups according to the discrete difference L _s and/or the discrete difference L _f.

4. A method according to claim 3, wherein determining the degree of discrete difference L _s between m plurality of class samples included in each of the plurality of class sample groups for a numerical feature of n _s dimensions, further comprises:

Dividing a numerical interval into a plurality of cells in each dimension of the n _s dimensions according to the numerical characteristics of the n _s dimensions of the M majority class samples;

Determining distribution conditions of m majority samples contained in each majority sample group among the cells;

determining the degree of dispersion of the m majority class samples in each dimension of the n _s dimensions according to the distribution condition;

synthesizing the discrete degree of the m majority samples in the n _s dimensions, and obtaining the discrete difference degree L _s of the m majority samples;

Wherein the discrete variance L _s of the m majority class samples is determined using the following formula:

Wherein n _s is a dimension of a numerical feature, k _t is a number of cells divided for each of n _s dimensions, the And the number of the m majority class samples falling among each divided cell is the m majority class samples.

5. A method according to claim 3, wherein determining the degree of discrete difference L _f between m plurality of class samples contained in each of said plurality of class sample groups for a descriptive feature of n _f dimensions, further comprises:

Determining, for each dimension in the n _f -dimensional descriptive feature, a number of minority class elements of m majority class samples contained in each group of majority class sample groups;

Determining the degree of dispersion of the m majority class samples in each dimension of the n _f dimensions according to the number of the minority class elements;

Synthesizing the discrete degrees of the m majority samples in the n _f dimensions to obtain the discrete degrees L _f of the m majority samples;

wherein the discrete variance L _f of the m majority class samples is determined using the following formula:

Wherein n _f is the dimension of the descriptive feature, the Representing a collectionThe number of different elements in the set/>The descriptive features representing the m majority class samples are for a feature set of the same dimension.

6. The method of claim 4, wherein determining the degree of variance for each of the plurality of groups of samples based on the distance and the degree of discrete variance comprises:

determining the sum of Euclidean distances between any one of the minority class samples and m majority class samples in each majority class sample group according to the numerical type characteristics:

and determining the difference degree of each plurality of sample groups as a weighted ratio of the sum of the discrete difference degree L and the Euclidean distance.

7. The method according to claim 2, wherein the method further comprises:

For each minority class sample, selecting any M majority class samples from all M majority class samples as a group of majority class samples to obtain A group of majority class samples;

Determining the said And the difference degree S _m of the plurality of sample groups is set, and a plurality of sample group with the largest difference degree is selected as a sampling result corresponding to any one of the plurality of sample groups.

8. The method of claim 2, wherein the majority class samples and the minority class samples comprise n _s -dimensional numerical features, the method further comprising:

for any one minority sample, calculating the distances between the M majority samples and the any one minority sample according to the numerical value characteristics;

Sequencing the M majority samples according to the distance to obtain a majority sample sequence;

selecting q sets of majority class sample sets from the majority class sample sequence, wherein each set of majority class sample sets comprises m majority class samples, and the m majority class samples are adjacent to each other in the majority class sample sequence;

And determining the difference degree S _m of the q groups of majority sample groups, and selecting a group of majority sample groups with the largest difference degree as a sampling result corresponding to any one minority sample.

9. The method according to claim 6 or 8, further comprising:

The Euclidean distance d _ij of any one of the majority class samples X _i and any one of the minority class samples Y _j is calculated using the following formula:

Wherein the majority sample X _i includes n _s -dimensional numerical features The minority class sample Y _j includes n _s -dimensional numerical features/>

10. A data processing apparatus, comprising:

The acquisition module is used for acquiring a training sample set in an image classification processing scene, wherein the training sample set comprises M majority samples acquired according to normal data and N minority samples acquired according to abnormal data, M, N is a positive integer, and M is larger than N;

The downsampling module is used for determining M majority samples which are discretely distributed around each minority sample according to the minority samples and each dimension characteristic of the majority samples so as to downsample the majority samples, M is smaller than M and is a positive integer, and the dimension characteristic comprises a numerical characteristic with specific numerical value and/or a descriptive characteristic with class distinguishing function;

the training module is used for training a classification model which is constructed for the image classification processing scene in advance according to the minority class samples and the majority class samples after downsampling;

And the processing module is used for processing the data according to the classification model.

11. The apparatus of claim 10, wherein the downsampling module is further configured to:

12. The apparatus of claim 11, wherein the downsampling module is further configured to:

The majority class samples and minority class samples include numerical features of n _s dimensions and/or descriptive features of n _f dimensions;

13. The apparatus of claim 12, wherein the downsampling module is further configured to:

14. The apparatus of claim 12, wherein the downsampling module is further configured to:

15. The apparatus of claim 11, wherein determining the degree of variance for each of the plurality of groups of samples based on the distance and the degree of discrete variance comprises:

16. The apparatus of claim 11, wherein the apparatus further comprises:

17. The apparatus of claim 11, wherein the majority class samples and the minority class samples comprise n _s -dimensional numerical features, the apparatus further comprising:

18. The apparatus according to claim 15 or 17, further comprising:

19. A data processing apparatus, comprising:

At least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform: the method of any one of claims 1-9.

20. A computer readable storage medium storing a program which, when executed by a multi-core processor, causes the multi-core processor to perform the method of any of claims 1-9.