CN113052198A

CN113052198A - Data processing method, device, equipment and storage medium

Info

Publication number: CN113052198A
Application number: CN201911384128.1A
Authority: CN
Inventors: 张玉; 张泽; 詹灵月; 余韦; 梁恩磊; 杨猛; 彭依校
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Priority date: 2019-12-28
Filing date: 2019-12-28
Publication date: 2021-06-29

Abstract

The embodiment of the invention discloses a data processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring non-numerical and/or discrete field data in a first sample set; reducing the dimension of the field data to obtain a boundary sample set related to the first sample set; determining at least one neighbor sample of the random samples according to the probability value of each sample in the first sample set; a second set of samples comprising a minority class of samples is generated based on the at least one neighbor sample and the set of boundary samples. The problem of the accuracy of classifying a small number of sample sets is improved on the premise of ensuring the stability of the accuracy of the whole classification can be solved.

Description

Data processing method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data processing method, a data processing device, terminal equipment and a storage medium.

Background

In practical applications, the classification objects are often unbalanced data sets, such as: malicious arrearage user mining, harassing call recognition, off-network user early warning and the like. Generally, a certain type of sample in such applications has a larger difference compared with other samples, which causes the classifier to be biased to a type with more samples in the classification and identification process, thereby neglecting a few types of features and causing the effect of the classifier to be reduced. Such classification recognition results in high accuracy, low recall rate, and poor overall recognition usability.

Therefore, on the premise of ensuring the stability of the overall classification accuracy, the improvement of the classification accuracy of the minority sample set becomes an urgent problem to be solved.

Disclosure of Invention

Embodiments of the present invention provide a data processing method and apparatus, a terminal device, and a storage medium, which can solve the problem of improving the accuracy of minority sample set classification on the premise of ensuring stable overall classification accuracy.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a data processing method, where the method may include:

acquiring non-numerical and/or discrete field data in a first sample set;

reducing the dimension of the field data to obtain a boundary sample set related to the first sample set;

determining at least one neighbor sample of the random samples according to the probability value of each sample in the first sample set;

a second set of samples comprising a minority class of samples is generated based on the at least one neighbor sample and the set of boundary samples.

In a possible embodiment, the step of performing dimension reduction on the field data to obtain the boundary sample set associated with the first sample set may specifically include:

coding the field data through the one-hot coding to obtain coded field data;

and (4) carrying out dimensionality reduction on the coding field data by using Principal Component Analysis (PCA) to obtain a boundary sample set.

In another possible embodiment, the step of "determining the probability value of each sample" mentioned above may specifically include:

dividing the first sample set into a first training set and a first prediction set, wherein the first training set and the second training set respectively comprise a plurality of samples;

predicting the first prediction set through a gradient boosting decision tree GBDT algorithm to obtain the probability value of each sample in the first prediction set;

determining the first sample set as a second prediction set and the first prediction set as the second sample set;

predicting the second prediction set through a GBDT algorithm to obtain the probability value of each sample in the second prediction set;

and determining the probability value of each sample according to the probability value of each sample in the first prediction set and the probability value of each sample in the second prediction set.

In another possible embodiment, the step of "determining at least one neighboring sample of the random sample according to the probability value of each sample in the first sample set" referred to above may specifically include:

at least one neighbor sample of the random samples is determined by a K-neighbor algorithm based on the probability value of each sample in the first set of samples.

In yet another possible embodiment, the step of "determining at least one neighboring sample of the random sample by using a K-nearest neighbor algorithm according to the probability value of each sample in the first sample set" referred to above may specifically include:

obtaining the probability value of the random number selected according to the probability value of each sample in the first sample set;

determining the random number corresponding to the probability value meeting the preset condition as a random sample;

at least one neighbor sample of the random sample is determined from the random sample.

In yet another possible embodiment, the method referred to above may further include:

in case at least one neighboring sample belongs to a sample in the boundary sample set,

performing difference calculation on continuous data in random samples and at least one adjacent sample by synthesizing a minority class oversampling SMOTE algorithm to obtain a first frequency value of each sample;

selecting a target frequency value meeting a first preset frequency condition from the plurality of first frequency values;

the target frequency value is determined as the frequency value of the second set of samples.

Alternatively, in yet another possible embodiment, the method mentioned above may further include:

in the case where a sample not belonging to the boundary sample set is included in the at least one neighboring sample,

performing difference calculation on continuous data in random samples and at least one adjacent sample through a step variation algorithm in a genetic algorithm to obtain a second frequency value of each sample;

selecting a target frequency value meeting a second preset frequency condition from the plurality of second frequency values;

In another possible embodiment, if the second frequency values of each sample are the same, any one of the second frequency values is randomly selected as the target frequency value.

In yet another possible embodiment, the step of generating the second sample set including a few types of samples based on at least one neighboring sample and the boundary sample set may specifically include:

a second sample set comprising minority class samples is generated based on the target frequency value, the at least one neighbor sample, and the boundary sample set.

repeatedly determining at least one neighbor sample of the random samples according to the number of the samples in the first sample set, and generating a second sample set based on the obtained neighbor samples and the boundary sample set; wherein the content of the first and second substances,

the second sample set is used for screening a few types of samples in any sample set; wherein, the few classes of samples include: characteristic data of off-network users.

In a second aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus may include:

the acquisition module is used for acquiring non-numerical field data and/or discrete field data in the first sample set;

the adjusting module is used for reducing the dimension of the field data to obtain a boundary sample set related to the first sample set;

the processing module is used for determining at least one neighbor sample of the random sample according to the probability value of each sample in the first sample set;

a generating module for generating a second sample set comprising a minority sample based on the at least one neighbor sample and the boundary sample set.

In a third aspect, an embodiment of the present invention provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and executable on the processor, where the computer program, when executed by the processor, implements the data processing method shown in the first aspect.

In a fourth aspect, there is provided a computer-readable storage medium having stored thereon a computer program for causing a computer to execute the data processing method according to the first aspect if the computer program is executed in the computer.

In the embodiment of the invention, the problem that the existing oversampling method is not suitable for character type data is effectively solved by performing dimension reduction on non-numerical and/or discrete field data in the first sample set, and in addition, the dimension reduction idea is applied to the SMOTE algorithm, so that the risk of high-latitude disasters is reduced, and the application range is wider. Meanwhile, aiming at the problem that the high-dimensional space is too huge and points in the high-dimensional space cannot be adjacent to each other at all, the K nearest neighbor algorithm is improved in a dimension reduction mode, so that the error is reduced. In addition, at least one neighbor sample of the random samples is determined from the probability value of each sample in the first sample set, and a second sample set including a minority class of samples is generated based on the at least one neighbor sample and the boundary sample set. Therefore, the improved K-nearest neighbor algorithm is adopted for derivation of the non-numerical fields and/or the discrete fields, and the problem that samples containing the non-numerical fields and/or the discrete fields cannot be oversampled by using a traditional method is effectively solved.

Drawings

The present invention will be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters designate like or similar features.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a comparison of a data processing method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Currently, the processing of the imbalance set is mainly focused on the algorithm layer and the data layer.

The processing of the algorithm layer is mainly to make the decision plane biased to the minority class by modifying the bias of the algorithm on the data, so as to improve the recognition rate of the minority class, such as an integrated learning method and a feature selection method. The core idea of the processing of the data layer is to resample the data set, such as under-sampling and over-sampling techniques.

Here, the oversampling technique for synthesizing the minority class is generally a synthesis timing over-sampling technique (SMOTE). The SMOTE algorithm is an improved scheme based on a random oversampling technology, and the problem that the random oversampling technology adds a few types of samples through simple replication to cause model overfitting and model learning information to be too special and not generalized is solved. The core idea of the SMOTE algorithm is to artificially synthesize a new sample set from its samples and add it to the dataset by analyzing a few classes. The basic approach is to perform linear interpolation between neighboring minority class samples to form new minority classes.

Two main schemes based on SMOTE algorithm improvement are provided, one scheme is that consideration for boundary samples and isolated points is added on the basis of the traditional SMOTE algorithm, and therefore approximately balanced data sets can be obtained. The main method comprises the following steps: searching k-neighbors for all the minority samples in the sample space of the unbalanced data set, and dividing the minority samples into 3 sample subsets according to the proportion of the number of the majority samples to the number of the minority samples in the k neighborhood: namely a safe sample set (the number of majority samples is less than the number of minority samples in the k neighborhood of minority samples), a dangerous sample set (the number of majority samples is not less than the number of minority samples in the k neighborhood of minority samples, and the number of minority samples is not 0) and an isolated sample set (all majority samples in the k neighborhood of current minority samples). For each minority sample in the set of dangerous samples, the SMOTE algorithm is applied in the total minority sample space.

The other scheme is a classification method Based on the improved-Based Spatial Clustering of Applications with Noise, DBSCAN-SMOTE algorithm. Firstly, boundary samples are judged in a data sample set, and the boundary samples are divided into majority class boundary samples and minority class boundary samples. The improved DBSCAN-based clustering algorithm is used for boundary samples in a majority type sample space (the algorithm can not only generate minority type clustering clusters, but also can perform oversampling in the sample clusters), and then a Particle Swarm Optimization (PSO) algorithm is used for optimizing the oversampling rates of the boundary samples and the safety samples in the clustering clusters, and the oversampling of different sampling rates is performed on the minority type boundary samples through an SMOTE algorithm.

However, when processing an unbalanced data set, both of the above two schemes have certain disadvantages, which are specifically as follows:

(1) the above schemes all use a K-nearest neighbor algorithm, which is based on distance calculation, and the effect of the K-nearest neighbor algorithm will continuously deteriorate as the feature dimension increases, because the high-dimensional space is too large, points in the high-dimensional space will not be adjacent to each other at all. Therefore, for samples with multiple dimensions, the K nearest neighbor algorithm cannot accurately judge nearest neighbors.

(2) If the above scheme is used for data processing, the sample data type of the data processing needs to be continuous data, but in practical application, a large amount of discrete data and character data exist, which results in that many data cannot balance the data set by using the above method.

(3) If the above scheme is used for data processing, the multiplying factor N must be a positive integer, which makes the number of new synthesized samples be an integral multiple of the number of samples in a few classes, so that the number of generated samples cannot be accurately controlled, and may have a certain influence on the performance of the classifier.

(4) The above-mentioned boundary selection scheme has certain limitations, and in the first scheme, boundary samples are selected according to the category of K neighbors, when a high-dimensional space is too large, points of the high-dimensional space cannot be close to each other at all, so that a large error exists in the scheme for distinguishing the boundaries. In the second scheme, when the density of spatial clustering is not uniform and the difference of clustering distances is large, the clustering quality of the DBSCAN is poor, the clustering effect depends on the selection of a distance formula of K nearest neighbors, the Euclidean distance is commonly used in practical application, and for high-dimensional data, dimension disaster exists.

(5) In the process of synthesizing a new sample, the information of partial neighbors is utilized, and certain limitations exist. If the K nearest neighbor samples are scattered, the newly generated samples are in most of the classes of samples, and the newly generated samples are likely to be noisy data, which may reduce the quality of the data set.

In view of the above disadvantages, the embodiment of the present invention provides an improved SMOTE algorithm, which is denoted as G-SMOTE, by combining GBDT and a mutation operator in a genetic algorithm. And selecting a target sample by combining G-SMOTE with GBDT, and deriving a new sample from the few types of target samples by using a mutation operator. The data processing method provided by the embodiment of the invention fully utilizes the peripheral data of the few samples at the edge, improves the quality of the new synthesized sample, and realizes the fine control of the synthesis quality of the few samples. The scheme also better expands decision space of a few classes, effectively avoids the problem that the boundary of the positive and negative samples is easily fuzzified by the existing algorithm, and ensures that the effect of synthesizing the new samples is more excellent.

The moving object warning method provided by the embodiment of the invention is explained in detail below.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention.

As shown in fig. 1, the data processing method may specifically include steps 110 to 140, which are specifically as follows:

and step 110, acquiring non-numerical and/or discrete field data in the first sample set.

And 120, performing dimensionality reduction on the field data to obtain a boundary sample set related to the first sample set.

The method comprises the steps that 1, field data are coded through one-hot coding to obtain coded field data; and (4) carrying out dimensionality reduction on the coding field data by using Principal Component Analysis (PCA) to obtain a boundary sample set.

For example, one-hot encoding is performed on a non-numerical field in the first sample set S to obtain S', and the PCA is used to perform dimensionality reduction on the encoded field data to obtain S ".

Here, the embodiment of the present invention improves the SMOTE algorithm by using one-hot coding, and improves the K-nearest neighbor algorithm in a dimension reduction manner to reduce errors, aiming at the problem that a high-dimensional space is too large and points in the high-dimensional space cannot be adjacent to each other at all.

In addition, the problem that the original SMOTE algorithm cannot be applied to non-numerical fields is effectively solved; the discrete field is derived by adopting an improved KNN algorithm, and a value with the highest frequency of variable neighbor value is selected as the value of the derived sample field, so that the problem that a sample containing a non-numerical field and a discrete field cannot be oversampled by using a traditional method is effectively solved.

At least one neighbor sample of the random sample is determined 130 based on the probability value for each sample in the first set of samples.

For example, following the example in step 120, a few classes of boundary sample sets S_l"is recorded as a population T_nProbability P of individual i being selected_iThis can be obtained as equation (1):

wherein, the probability of the boundary sample set is recorded as F_iAs determined by equation (7), the individual i in the population is 1,2,3, … … n, and the cumulative probability of each individual is shown by equation (2):

here, a specific manner of determining the random sample a may be as follows:

randomly generating a random number r (r is more than 0 and less than 1) if r is less than or equal to q₁Then sample X_iSelecting the selected plants; if q is_a-1＜r≤q_a(2 is more than or equal to a and less than or equal to N), the random sample a is selected.

Thus, K neighbors of the random sample a can be calculated, and the K neighbor samples of a are marked as y₁，y2，y3，……y^k。

Here, in a possible example, before performing the step, determining a probability value of each sample in the first sample set may further be included, and a specific implementation is as follows:

dividing the first sample set into a first training set and a first prediction set, wherein the first training set and the second training set respectively comprise a plurality of samples; predicting the first prediction set through a gradient boosting decision tree GBDT algorithm to obtain the probability value of each sample in the first prediction set; determining the first sample set as a second prediction set and the first prediction set as the second sample set; predicting the second prediction set through a GBDT algorithm to obtain the probability value of each sample in the second prediction set; and determining the probability value of each sample according to the probability value of each sample in the first prediction set and the probability value of each sample in the second prediction set.

For example, the first sample set S is divided into two parts, which are denoted as S1 and S2. Taking S1 as the first training set and S2 as the first prediction set, predicting S2 by using GBDT to obtain the probability of S2 sample, specifically, by initializing a base learning model, that is, obtaining a minimum value by referring to equation (3):

wherein m is the total number m of samples in the first sample set S (m is a positive integer greater than or equal to 1), c is a predicted value corresponding to any tree in GBDT, and y_iThe true value of the sample is characterized (in practical applications, it can be characterized as whether the user represented by the sample is in an off-network state or an on-network state).

Based on this, calculating the negative gradient of T (T ═ 1,2,3 … … T) iterations can be implemented by equation (4), as shown below:

wherein, x and x_iFor the characteristics of the sample, use (r)_i，r_ti) i-1, 2,3 …, m fitting the tth CART regression tree, whichThe corresponding leaf node region is r_tiAnd i is 1,2,3, … … J, and J is the number of leaf nodes of the regression tree.

Next, for the leaf node region i of 1,2,3, … … J, the best-fit value C is calculated_tjCan be shown by equation (5):

then, based on the best fit value C_tjAnd the strong learning model is updated by equation (6) to predict the probability of the sample t in S2

Wherein I is an indicator function. Then, all the data are counted by a plurality of calculations to obtain a calculated strong learner, as shown in the following formula (7), for predicting the probability of the sample in S2:

thus, the method described above is repeated with S1 replaced with S2, S2 as the second training set and S1 as the second prediction set, and the second prediction set is predicted by GBDT to obtain the probability of each sample in the second prediction set. In this way, the probability of each sample in the entire S data set can be obtained. Here, the probability of the boundary sample set of the minority class in step 120 may be screened out according to the probability of each sample, and the probability of the boundary sample set is denoted as F in the embodiment of the present invention_i。

Based on this, step 130 may specifically include determining at least one neighboring sample of the random samples by a K-neighbor algorithm according to the probability value of each sample in the first set of samples.

Further, obtaining a probability value of the random number being selected according to the probability value of each sample in the first sample set; determining the random number corresponding to the probability value meeting the preset condition as a random sample; at least one neighbor sample of the random sample is determined from the random sample.

Based on the at least one neighbor sample and the set of boundary samples, a second set of samples is generated comprising a minority class of samples, step 140.

Repeatedly determining at least one neighbor sample of the random samples according to the number of the samples in the first sample set, and generating a second sample set based on the obtained neighbor samples and the boundary sample set; the second sample set is used for screening a few types of samples in any sample set; wherein, the few classes of samples include: characteristic data of off-network users.

For example, step 130 and step 140 are repeated according to the required number N of samples, and the target sample is selected from the minority class repeatedly until N is greater than the number of samples in the minority class. The original few samples (i.e. the set of boundary samples) and the synthetic samples (i.e. the at least one neighboring sample) are combined into a new few samples (i.e. the second set of samples).

Here, in a possible embodiment, before the step of generating the second sample set, the step of determining the frequency value of the second sample set may further include determining the frequency value of the second sample set, where the embodiment of the present invention provides that the frequency value of the second sample set is determined in two different cases, which are specifically shown as follows:

scenario 1, in case at least one neighboring sample belongs to a sample in the set of boundary samples,

For example, based on the examples in step 120 and step 130, if all the K samples are a few samples, the continuum field interpolates the continuous data at random sample a and K neighboring samples based on the original SMOTE algorithm, and equation (8) for calculating the second sample set is as follows:

for the discrete field, frequency numbers of values of the discrete field in K adjacent samples of the random sample a, namely first frequency values, are calculated, and the field with the highest frequency number is selected as a value of the discrete field of the new sample, namely a target frequency value. If the frequency counts are the same, then a value is randomly selected as the target frequency value for the field (i.e., the second set of samples).

Scenario 2, in the case where a sample not belonging to the boundary sample set is included in the at least one neighboring sample,

Here, if the second frequency value of each sample is the same, any one of the second frequency values is randomly selected as the target frequency value.

For example, based on the examples in step 120 and step 130, if the K neighboring samples are not all of a few classes of samples, the continuous type generates a new second sample set based on the step size mutation algorithm in the genetic algorithm, and the new second sample set Xnew equations (9) and (10) are implemented as follows:

x_new＝x_a+0.5×Ld (9)

wherein d is the step length. For the discrete field, frequency of each value of the discrete field in K neighbors of the random sample a is calculated, and the field with the highest frequency is selected as the value of the discrete field of the new sample, namely the target frequency value. If the frequency counts are the same, then a value is randomly selected as the target frequency value for the field (i.e., the second set of samples).

Based on this, step 140 may specifically include: a second sample set comprising minority class samples is generated based on the target frequency value, the at least one neighbor sample, and the boundary sample set.

Therefore, the embodiment of the invention calculates the probability of a few samples by using GBDT, selects effective samples to derive new samples according to the probability value and combining the genetic algorithm, selectively utilizes the information of different types of samples according to K neighbors, avoids the newly generated samples from being in a plurality of types, improves the quality of the samples generated by the second sample set, effectively avoids the limitation that the existing scheme directly adopts the K neighbors to derive the target samples, and effectively solves the problem of fuzzification of the edges of the samples. In addition, the method in the embodiment of the invention improves the defect that the number of new samples synthesized by the existing sampling scheme is necessarily integral multiple of the number of samples in a few classes. According to the scheme, a few types of samples with any number can be generated according to the sample proportion, so that the number of generated samples can be accurately controlled, and the performance of the classifier is improved.

In addition, in the embodiment of the present invention, in order to provide data support for the method, specifically, by combining the prior art and the method provided in the embodiment of the present invention for comparison, the specific method is as follows:

first, a comparative application scenario is introduced: in order to realize the prediction of the household broadband loss user, a model is built by using the data of 1 month, and the effect of each scheme is tested by using the data of 2 months. Wherein, 988018 total sample data of 1 month, wherein the lost user 38070 of 1 month, positive and negative sample ratio is 1: 24.9, there was a severe imbalance in the data.

The data processing method specifically adopted comprises the following three modes:

the first method is as follows: no processing is done for the 1 month data.

The second method comprises the following steps: the data is processed according to the existing SMOTE algorithm month 1, and 56924 positive samples are generated in total.

The third method comprises the following steps: the data for month 1 is processed according to the embodiment of the present invention, and 56924 positive samples are generated in total. The specific process is as above step 110-140, repeat the above operation 56924 times, and merge the second sample set with the first sample set.

Then, modeling evaluation is performed on the three modes, wherein model training is performed on data processed by the three modes, data in 2 months are predicted, and comparison and verification show that the effect of the scheme is better than that of the SMOTE algorithm and that no processing is performed, and the specific data can be shown in FIG. 2.

Therefore, compared with the first mode and the second mode, the embodiment of the invention has the following technical advantages:

(1) the applicability is wide. The embodiment of the invention utilizes the value of the discrete field of the KNN algorithm derived sample, thereby effectively solving the problem that the existing scheme can not realize oversampling on discrete data; one-hot coding is used for deriving the non-numerical field, so that the problem that the existing oversampling method is not suitable for character data is solved; the dimensionality reduction idea is applied to the SMOTE algorithm, the risk of high-latitude disasters is reduced, and the application range is wider.

(2) And (6) accurately identifying. Aiming at the problem of unbalanced sample data in the classification and the condition that the identification characteristics of a few types of samples are not obvious, the improved oversampling method provided selectively utilizes sample information of different types, improves the quality of a new sample, effectively solves the problem that the few types are difficult to identify under the problem of unbalanced samples, and further improves the accuracy rate of model identification.

(3) The number of samples is arbitrary. Aiming at the problem that the number of the samples synthesized by the existing scheme is necessarily integral multiple of the number of the minority samples, the scheme can generate any number of minority samples according to the sample proportion, can accurately control the number of the generated samples, and improves the performance of the classifier.

Therefore, based on the above method, an embodiment of the present invention further provides a data processing apparatus, which is specifically described with reference to fig. 3.

Fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.

As shown in fig. 3, the data processing apparatus 30 may specifically include:

an obtaining module 301, configured to obtain field data of a non-numerical type and/or a discrete type in a first sample set;

an adjusting module 302, configured to perform dimension reduction on the field data to obtain a boundary sample set related to the first sample set;

a processing module 303, configured to determine at least one neighbor sample of the random sample according to the probability value of each sample in the first sample set;

a generating module 304 for generating a second sample set comprising a minority class of samples based on the at least one neighbor sample and the boundary sample set.

The adjusting module 302 in the embodiment of the present invention may be specifically configured to encode the field data by one-hot encoding to obtain encoded field data;

In a possible embodiment, the processing module 303 in the embodiment of the present invention may further determine a probability value of each sample, where the first sample set is divided into a first training set and a first prediction set, and the first training set and the second training set respectively include a plurality of samples;

Further, the processing module 303 in this embodiment of the present invention may be specifically configured to determine at least one neighbor sample of the random samples through a K-neighbor algorithm according to the probability value of each sample in the first sample set.

Based on this, the processing module 303 may be specifically configured to obtain a probability value of the selected random number according to the probability value of each sample in the first sample set;

Further, the data processing apparatus 30 may further include: a module 305 is determined.

In one possible embodiment, the determining module 305 may be configured to, in case at least one neighboring sample belongs to a sample in the boundary sample set,

In another possible embodiment, the determining module 305 may be further configured to, in case that a sample not belonging to the boundary sample set is included in the at least one neighboring sample,

And if the second frequency value of each sample is the same, randomly selecting any one second frequency value as the target frequency value.

In a possible embodiment, the generating module 304 in the embodiment of the present invention may be specifically configured to generate the second sample set including the minority sample according to the target frequency value, the at least one neighboring sample, and the boundary sample set.

In another possible embodiment, the generating module 304 in this embodiment of the present invention may be specifically configured to repeatedly determine at least one neighboring sample of the random samples according to the number of samples in the first sample set, and generate the second sample set based on the obtained neighboring sample and the boundary sample set; wherein the content of the first and second substances,

the second sample set is used for screening a few types of samples in any sample set; wherein the minority class samples comprise feature data of off-network users.

The terminal device 400 includes but is not limited to: radio frequency unit 401, network module 402, audio output unit 403, input unit 404, sensor 405, display unit 406, user input unit 407, interface unit 408, memory 409, processor 410, and power supply 411. Those skilled in the art will appreciate that the terminal device configuration shown in fig. 4 does not constitute a limitation of the terminal device, and that the terminal device may include more or fewer components than shown, or combine certain components, or a different arrangement of components. In the embodiment of the present invention, the terminal device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 401 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink resources from a base station and then processes the received downlink resources to the processor 410; in addition, the uplink resource is transmitted to the base station. Typically, radio unit 401 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. Further, the radio unit 401 can also communicate with a network and other devices through a wireless communication system.

The terminal device provides wireless broadband internet access to the user through the network module 402, such as helping the user send and receive e-mails, browse web pages, and access streaming media.

The audio output unit 403 may convert an audio resource received by the radio frequency unit 401 or the network module 402 or stored in the memory 409 into an audio signal and output as sound. Also, the audio output unit 403 may also provide audio output related to a specific function performed by the terminal apparatus 400 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 403 includes a speaker, a buzzer, a receiver, and the like.

The input unit 404 is used to receive audio or video signals. The input Unit 404 may include a Graphics Processing Unit (GPU) 4041 and a microphone 4042, and the Graphics processor 4041 processes an image resource of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capture mode or an image capture mode. The processed image frames may be displayed on the display unit 407. The image frames processed by the graphic processor 4041 may be stored in the memory 409 (or other storage medium) or transmitted via the radio frequency unit 401 or the network module 402. The microphone 4042 may receive sound and be capable of processing such sound into an audio asset. The processed audio resources may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 401 in case of the phone call mode.

The terminal device 400 further comprises at least one sensor 405, such as light sensors, motion sensors and other sensors. Specifically, the light sensor includes an ambient light sensor that adjusts the brightness of the display panel 4061 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 4061 and/or the backlight when the terminal apparatus 400 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the terminal device posture (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration identification related functions (such as pedometer, tapping), and the like; the sensors 405 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which will not be described in detail herein.

The display unit 406 is used to display information input by the user or information provided to the user. The Display unit 406 may include a Display panel 4061, and the Display panel 4061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 407 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the terminal device. Specifically, the user input unit 407 includes a touch panel 4071 and other input devices 4072. Touch panel 4071, also referred to as a touch screen, may collect touch operations by a user on or near it (e.g., operations by a user on or near touch panel 4071 using a finger, a stylus, or any suitable object or attachment). The touch panel 4071 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 410, receives a command from the processor 410, and executes the command. In addition, the touch panel 4071 can be implemented by using various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 4071, the user input unit 407 may include other input devices 4072. Specifically, the other input devices 4072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a track ball, a mouse, and a joystick, which are not described herein again.

Further, the touch panel 4071 can be overlaid on the display panel 4061, and when the touch panel 4071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 410 to determine the type of the touch event, and then the processor 410 provides a corresponding visual output on the display panel 4061 according to the type of the touch event. Although in fig. 4, the touch panel 4071 and the display panel 4061 are two independent components to implement the input and output functions of the terminal device, in some embodiments, the touch panel 4071 and the display panel 4061 may be integrated to implement the input and output functions of the terminal device, which is not limited herein.

The interface unit 408 is an interface for connecting an external device to the terminal apparatus 400. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless resource port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 408 may be used to receive input (e.g., resource information, power, etc.) from an external device and transmit the received input to one or more elements within the terminal apparatus 400 or may be used to transmit resources between the terminal apparatus 400 and an external device.

The memory 409 may be used to store software programs as well as various resources. The memory 409 may mainly include a storage program area and a storage resource area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage resource area may store resources (such as audio resources, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 409 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 410 is a control center of the terminal device, connects various parts of the entire terminal device by using various interfaces and lines, and performs various functions and processing resources of the terminal device by running or executing software programs and/or modules stored in the memory 409 and calling resources stored in the memory 409, thereby performing overall monitoring of the terminal device. Processor 410 may include one or more processing units; preferably, the processor 410 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.

The terminal device 400 may further include a power supply 411 (such as a battery) for supplying power to each component, and preferably, the power supply 411 may be logically connected to the processor 410 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.

In addition, the terminal device 400 includes some functional modules that are not shown, and are not described in detail herein.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which, when executed in a computer, causes the computer to perform the steps of the data processing method of an embodiment of the present invention.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A data processing method, comprising:

acquiring non-numerical and/or discrete field data in a first sample set;

determining at least one neighbor sample of random samples according to the probability value of each sample in the first sample set;

generating a second set of samples comprising a minority class of samples based on the at least one neighbor sample and the set of boundary samples.

2. The method of claim 1, wherein the dimension reduction of the field data to obtain a boundary sample set related to the first sample set comprises:

coding the field data through one-hot coding to obtain coded field data;

and performing dimension reduction on the coding field data by utilizing Principal Component Analysis (PCA) to obtain the boundary sample set.

3. The method of claim 1 or 2, wherein determining the probability value for each sample comprises:

predicting the first prediction set by a Gradient Boosting Decision Tree (GBDT) algorithm to obtain a probability value of each sample in the first prediction set;

determining the first set of samples as a second prediction set and the first prediction set as the second set of samples;

predicting the second prediction set through the GBDT algorithm to obtain a probability value of each sample in the second prediction set;

4. The method of claim 3, wherein the determining at least one neighbor sample of random samples according to the probability value of each sample in the first set of samples comprises:

determining at least one neighbor sample of random samples by a K-neighbor algorithm according to the probability value of each sample in the first sample set.

5. The method of claim 1 or 4, wherein the determining at least one neighbor sample of random samples by a K-neighbor algorithm according to the probability value of each sample in the first sample set comprises:

determining the random number corresponding to the probability value meeting the preset condition as the random sample;

determining at least one neighbor sample of the random samples from the random samples.

6. The method of claim 5, further comprising:

in case the at least one neighboring sample belongs to a sample of the set of boundary samples,

performing difference calculation on continuous data in the random sample and the at least one adjacent sample by synthesizing a minority class oversampling SMOTE algorithm to obtain a first frequency value of each sample;

determining the target frequency value as a frequency value of the second set of samples.

7. The method of claim 5, further comprising:

in case of including samples in the at least one neighboring sample that do not belong in the boundary sample set,

performing difference calculation on the continuous data at the random sample and the at least one adjacent sample through a step variation algorithm in a genetic algorithm to obtain a second frequency value of each sample;

8. The method of claim 7, wherein if the second frequency values of each sample are the same, then randomly selecting any one of the second frequency values as the target frequency value.

9. The method of claim 6 or 5, wherein generating a second set of samples comprising a minority class of samples based on the at least one neighbor sample and the set of boundary samples comprises:

generating a second sample set comprising a minority class of samples from the target frequency value, the at least one neighbor sample, and the boundary sample set.

10. The method of claim 1, wherein generating a second set of samples comprising a minority class of samples based on the at least one neighbor sample and the set of boundary samples comprises:

according to the number of samples in the first sample set, repeatedly determining at least one neighbor sample of random samples, and generating the second sample set based on the obtained neighbor samples and the boundary sample set; wherein the content of the first and second substances,

11. A data processing apparatus, characterized in that the apparatus comprises:

a generating module for generating a second sample set comprising a minority class of samples based on the at least one neighbor sample and the boundary sample set.

12. A terminal device, characterized in that it comprises a processor, a memory and a computer program stored on said memory and executable on said processor, said computer program, when executed by said processor, implementing a data processing method according to claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored, which, if the computer program is executed in a computer, causes the computer to carry out the data processing method according to claims 1 to 10.